Machine learning anomaly detection limitations
editMachine learning anomaly detection limitations
editThe following limitations and known problems apply to the 8.16.2 release of the Elastic machine learning features. The limitations are grouped into four categories:
- Platform limitations are related to the platform that hosts the machine learning feature of the Elastic Stack.
- Configuration limitations apply to the configuration process of the anomaly detection jobs.
- Operational limitations affect the behavior of the anomaly detection jobs that are running.
- Limitations in Kibana only apply to anomaly detection jobs managed via the user interface.
Platform limitations
editCPUs must support SSE4.2
editMachine learning uses Streaming SIMD Extensions (SSE) 4.2 instructions, so it works only
on machines whose CPUs
support SSE4.2. If you run
Elasticsearch on older hardware you must disable machine learning by setting xpack.ml.enabled
to
false
. See Machine learning settings in Elasticsearch.
CPU scheduling improvements apply to Linux and MacOS only
editWhen there are many machine learning jobs running at the same time and there are insufficient CPU resources, the JVM performance must be prioritized so search and indexing latency remain acceptable. To that end, when CPU is constrained on Linux and MacOS environments, the CPU scheduling priority of native analysis processes is reduced to favor the Elasticsearch JVM. This improvement does not apply to Windows environments.
License limitations for machine learning jobs with CCS
editYou must have an appropriate license to initiate Machine learning jobs on datasets from remote clusters accessed through Cross-Cluster Search (CCS). Refer to the [Subscriptions](https://www.elastic.co/subscriptions) page for details on features available with different subscription levels.
Configuration limitations
editTerms aggregation size affects data analysis
editBy default, the terms
aggregation returns the buckets for the top ten terms.
You can change this default behavior by setting the size
parameter.
If you send pre-aggregated data to a job for analysis, you must ensure that the
size
is configured correctly. Otherwise, some data might not be analyzed.
Scripted metric aggregations are not supported
editUsing scripted metric aggregations in datafeeds is not supported. Refer to the Aggregating data for faster performance page to learn more about aggregations in datafeeds.
Fields named "by", "count", or "over" cannot be used to split data
editYou cannot use the following field names in the by_field_name
or
over_field_name
properties in a job: by
; count
; over
. This limitation
also applies to those properties when you create advanced jobs in Kibana.
Arrays in analyzed fields are turned into comma-separated strings
editIf an anomaly detection job is configured to analyze an aggregatable field (a field that
is part of the index mapping definition), and this field contains an array, then
the array is turned into a comma-separated concatenated string. The items in the
array are sorted alphabetically and the duplicated items are removed. For
example, the array ["zebra", "dog", "cat", "alligator", "cat"]
becomes
alligator,cat,dog,zebra
. The Anomaly Explorer charts don’t display any results
for the job as the string does not exist in the source data. The Single Metric
Viewer displays results if the model plot is enabled.
If an array field is not aggregatable and is retrieved from _source
, the array
is also turned into a comma-separated, concatenated list. However, the list
items are not sorted alphabetically, nor are they deduplicated. Taking the
example above, the comma-separated list, in this case, would be
zebra,dog,cat,alligator,cat
.
Analyzing large arrays results in long strings which may require more system resources. Consider using a query in the datafeed that filters on the relevant items of the array.
Anomaly detection jobs on frozen tier data cannot be created in Kibana
editYou cannot create anomaly detection jobs on frozen tier data through the job wizards in Kibana. If you want to create such jobs, use the APIs instead.
Unsupported forecast configurations
editThere are some limitations that affect your ability to create a forecast:
- You can generate only three forecasts per anomaly detection job concurrently. There is no limit to the number of forecasts that you retain. Existing forecasts are not overwritten when you create new forecasts. Rather, they are automatically deleted when they expire.
-
If you use an
over_field_name
property in your anomaly detection job (that is to say, it’s a population job), you cannot create a forecast. -
If you use any of the following analytical functions in your anomaly detection job, you cannot create a forecast:
-
lat_long
-
rare
andfreq_rare
-
time_of_day
andtime_of_week
For more information about any of these functions, see Function reference.
-
Anomaly detection performs better on indexed fields
editAnomaly detection jobs sort all data by a user-defined time field, which is frequently accessed. If the time field is a runtime field, the performance impact of calculating field values at query time can significantly slow the job. Use an indexed field as a time field when running anomaly detection jobs.
Deprecation warnings for Painless scripts in datafeeds
editIf a datafeed contains Painless scripts that use deprecated syntax, deprecation warnings are displayed when the datafeed is previewed or started. However, it is not possible to check for deprecation warnings across all datafeeds as a bulk action because running the required queries might be a resource intensive process. Therefore any deprecation warnings due to deprecated Painless syntax are not available in the Upgrade assistant.
Operational limitations
editCategorization uses English dictionary words
editCategorization identifies static parts of unstructured logs and groups similar
messages together. The default categorization tokenizer assumes English language
log messages. For other languages you must define a different
categorization_analyzer
for your job.
Additionally, a dictionary used to influence the categorization process contains only English words. This means categorization might work better in English than in other languages. The ability to customize the dictionary will be added in a future release.
Misleading high missing field counts
editOne of the counts associated with a machine learning job is missing_field_count
,
which indicates the number of records that are missing a configured field.
Since jobs analyze JSON data, the missing_field_count
might be misleading.
Missing fields might be expected due to the structure of the data and therefore
do not generate poor results.
For more information about missing_field_count
,
see the get anomaly detection job statistics API.
Security integration
editWhen the Elasticsearch security features are enabled, a datafeed stores the roles of the user who created or updated the datafeed at that time. This means that if the roles the user has are changed after they create or update a datafeed then the datafeed continues to run without change. However, if instead the permissions associated with the roles that are stored with the datafeed are changed then this affects the datafeed. For more information, see Datafeeds.
Job and datafeed APIs have a maximum search size
editIn 6.6 and later releases, the get jobs API and the get job statistics API return a maximum of 10,000 jobs. Likewise, the get datafeeds API and the get datafeed statistics API return a maximum of 10,000 datafeeds.
Forecast operational limitations
editThere are some factors that may be considered when you run forecasts:
- Forecasts run concurrently with real-time machine learning analysis. That is to say, machine learning analysis does not stop while forecasts are generated. Forecasts can have an impact on anomaly detection jobs, however, especially in terms of memory usage. For this reason, forecasts run only if the model memory status is acceptable.
- The anomaly detection job must be open when you create a forecast. Otherwise, an error occurs.
- If there is insufficient data to generate any meaningful predictions, an error occurs. In general, forecasts that are created early in the learning phase of the data analysis are less accurate.
Limitations in Kibana
editPop-ups must be enabled in browsers
editThe machine learning features in Kibana use pop-ups. You must configure your web browser so that it does not block pop-up windows or create an exception for your Kibana URL.
Anomaly Explorer and Single Metric Viewer omissions and limitations
editIn Kibana, Anomaly Explorer and Single Metric Viewer charts are not displayed:
- for anomalies that were due to categorization (if model plot is not enabled),
- if the datafeed uses scripted fields and model plot is not enabled (except for scripts that define metric fields),
-
if the datafeed uses
composite aggregations
that have composite sources other than
terms
anddate_histogram
, -
if your datafeed uses aggregations with nested
terms
aggs and model plot is not enabled, -
freq_rare
functions, -
info_content
,high_info_content
,low_info_content
functions, -
lat_long
geographic functions -
time_of_day
,time_of_week
functions, -
varp
,high_varp
,low_varp
functions.
Refer to the table below for a more detailed view of supported detector functions.
The charts can also look odd in circumstances where there is very little data to plot. For example, if there is only one data point, it is represented as a single dot. If there are only two data points, they are joined by a line. The following table shows which detector functions are supported in the Single Metric Viewer.
Table 1. Detector function support in the Anomaly Explorer and the Single Metric Viewer
Detector functions | Function description | Supported |
---|---|---|
count, high_count, low_count, non_zero_count, low_non_zero_count |
yes |
|
count, high_count, low_count, non_zero_count, low_non_zero_count with summary_count_field_name that is not doc_count (model plot not enabled) |
yes |
|
non_zero_count with summary_count_field that is not doc_count using cardinality aggregation in datafeed config (model plot not enabled) |
yes |
|
distinct_count, high_distinct_count, low_distinct_count |
yes |
|
mean, high_mean, low_mean |
yes |
|
min |
yes |
|
max |
yes |
|
metric |
yes |
|
median, high_median, low_median |
yes |
|
sum, high_sum ,low_sum, non_null_sum, high_non_null_sum, low_non_null_sum |
yes |
|
varp, high_varp, low_varp |
yes (only if model plot is enabled) |
|
lat_long |
no (but map is displayed in the Anomaly Explorer) |
|
info_content, high_info_content, low_info_content |
yes (only if model plot is enabled) |
|
rare |
yes |
|
freq_rare |
no |
|
time_of_day, time_of_week |
no |
Jobs created in Kibana must use datafeeds
editIf you create jobs in Kibana, you must use datafeeds. If the data that you want to analyze is not stored in Elasticsearch, you cannot use datafeeds and therefore you cannot create your jobs in Kibana. You can, however, use the machine learning APIs to create jobs. For more information, see Datafeeds and API quick reference.
Jobs created in Kibana use model plot config and pre-aggregated data
editIf you create single or multi-metric jobs in Kibana, it might enable some options under the covers that you’d want to reconsider for large or long-running jobs.
For example, when you create a single metric job in Kibana, it generally
enables the model_plot_config
advanced configuration option. That
configuration option causes model information to be stored along with the
results and provides a more detailed view into anomaly detection. It is
specifically used by the Single Metric Viewer in Kibana. When this option is
enabled, however, it can add considerable overhead to the performance of the
system. If you have jobs with many entities, for example data from tens of
thousands of servers, storing this additional model information for every bucket
might be problematic. If you are not certain that you need this option or if you
experience performance issues, edit your job configuration to disable this
option.
Likewise, when you create a single or multi-metric job in Kibana, in some cases
it uses aggregations on the data that it retrieves from Elasticsearch. One of the
benefits of summarizing data this way is that Elasticsearch automatically distributes
these calculations across your cluster. This summarized data is then fed into
machine learning instead of raw results, which reduces the volume of data that must
be considered while detecting anomalies. However, if you have two jobs, one of
which uses pre-aggregated data and another that does not, their results might
differ. This difference is due to the difference in precision of the input data.
The machine learning analytics are designed to be aggregation-aware and the likely increase
in performance that is gained by pre-aggregating the data makes the potentially
poorer precision worthwhile. If you want to view or change the aggregations
that are used in your job, refer to the aggregations
property in your datafeed.
When the aggregation interval of the datafeed and the bucket span of the job don’t match, the values of the chart plotted in both the Single Metric Viewer and the Anomaly Explorer differ from the actual values of the job. To avoid this behavior, make sure that the aggregation interval in the datafeed configuration and the bucket span in the anomaly detection job configuration have the same values.
Calendars and filters are visible in all Kibana spaces
editSpaces enable you to organize your anomaly detection jobs in Kibana and to see only the jobs and other saved objects that belong to your space. However, this limited scope does not apply to calendars and filters; they are visible in all spaces.
Rollup indices are not supported in Kibana
editRollup indices and data views with rolled up indices cannot be used in anomaly detection jobs or datafeeds in Kibana. If you try to analyze data that exists in an index that uses the experimental data rollup features, the anomaly detection job creation wizards fail. If you use APIs to create anomaly detection jobs that use data rollup features, the job results might not display properly in the Single Metric Viewer or Anomaly Explorer in Kibana.