Using the annotated-text field
editUsing the annotated-text field
editThe annotated-text tokenizes text content as per the more common text field (see
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:
PUT my-index-000001
{
"mappings": {
"properties": {
"my_field": {
"type": "annotated_text"
}
}
}
}
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
one or more values separated by the & symbol.
We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:
GET my-index-000001/_analyze
{
"field": "my_field",
"text":"Investors in [Apple](Apple+Inc.) rejoiced."
}
Response:
{
"tokens": [
{
"token": "investors",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "in",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Apple Inc.",
"start_offset": 13,
"end_offset": 18,
"type": "annotation",
"position": 2
},
{
"token": "apple",
"start_offset": 13,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "rejoiced",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 3
}
]
}
|
Note the whole annotation token |
We can now perform searches for annotations using regular term queries that don’t tokenize
the provided search values. Annotations are a more precise way of matching as can be seen
in this example where a search for Beck will not match Jeff Beck :
# Example documents
PUT my-index-000001/_doc/1
{
"my_field": "[Beck](Beck) announced a new tour"
}
PUT my-index-000001/_doc/2
{
"my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"
}
# Example search
GET my-index-000001/_search
{
"query": {
"term": {
"my_field": "Beck"
}
}
}
|
As well as tokenising the plain text into single words e.g. |
|
|
Note annotations can inject multiple tokens at the same position - here we inject both
the very specific value |
|
|
A benefit of searching with these carefully defined annotation tokens is that a query for
|
Any use of = signs in annotation values eg [Prince](person=Prince) will
cause the document to be rejected with a parse failure. In future we hope to have a use for
the equals signs so will actively reject documents that contain this today.
Synthetic _source
editIf using a sub-keyword field then the values are sorted in the same way as
a keyword field’s values are sorted. By default, that means sorted with
duplicates removed. So:
PUT idx
{
"settings": {
"index": {
"mapping": {
"source": {
"mode": "synthetic"
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "annotated_text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
PUT idx/_doc/1
{
"text": [
"the quick brown fox",
"the quick brown fox",
"jumped over the lazy dog"
]
}
Will become:
{
"text": [
"jumped over the lazy dog",
"the quick brown fox"
]
}
Reordering text fields can have an effect on phrase
and span queries. See the discussion about position_increment_gap for more detail. You
can avoid this by making sure the slop parameter on the phrase queries
is lower than the position_increment_gap. This is the default.
If the annotated_text field sets store to true then order and duplicates
are preserved.
PUT idx
{
"settings": {
"index": {
"mapping": {
"source": {
"mode": "synthetic"
}
}
}
},
"mappings": {
"properties": {
"text": { "type": "annotated_text", "store": true }
}
}
}
PUT idx/_doc/1
{
"text": [
"the quick brown fox",
"the quick brown fox",
"jumped over the lazy dog"
]
}
Will become:
{
"text": [
"the quick brown fox",
"the quick brown fox",
"jumped over the lazy dog"
]
}