IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Reverse nested Aggregation Significant Terms Aggregation »

› › ›

Sampler Aggregation

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Sampler Aggregation

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

Example use cases:

Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

A query on StackOverflow data for the popular term javascript OR the rarer term kibana will match many documents - most of them missing the word Kibana. To focus the significant_terms aggregation on top-scoring documents that are more likely to match the most interesting parts of our query we use a sample.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["kibana", "javascript"]
                    }
                }
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "sample": {
            "doc_count": 1000,
            "keywords": {
                "doc_count": 1000,
                "buckets": [
                    {
                        "key": "elasticsearch",
                        "doc_count": 150,
                        "score": 1.078125,
                        "bg_count": 200
                    },
                    {
                        "key": "logstash",
                        "doc_count": 50,
                        "score": 0.5625,
                        "bg_count": 50
                    }
                ]
            }
        }
    }
}

1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

Without the sampler aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies less significant terms such as jquery and angular rather than focusing on the more insightful Kibana-related terms.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
             "low_quality_keywords": {
                "significant_terms": {
                    "field": "tags",
                    "size": 3,
                    "exclude":["kibana", "javascript"]
                }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "low_quality_keywords": {
            "doc_count": 1000,
            "buckets": [
                {
                    "key": "angular",
                    "doc_count": 200,
                    "score": 0.02777,
                   "bg_count": 200
                },
                {
                    "key": "jquery",
                    "doc_count": 200,
                    "score": 0.02777,
                    "bg_count": 200
                },
                {
                    "key": "logstash",
                    "doc_count": 50,
                    "score": 0.0069,
                    "bg_count": 50
                }
            ]
        }
    }
}

shard_size

edit

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

edit

Cannot be nested under `breadth_first` aggregations

edit

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

« Reverse nested Aggregation Significant Terms Aggregation »

Sampler Aggregation

Sampler Aggregation

shard_size

Limitations

Cannot be nested under breadth_first aggregations

Cannot be nested under `breadth_first` aggregations