WARNING: Version 6.2 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
UAX URL Email Tokenizer
editUAX URL Email Tokenizer
editThe uax_url_email
tokenizer is like the standard
tokenizer except that it
recognises URLs and email addresses as single tokens.
Example output
editPOST _analyze { "tokenizer": "uax_url_email", "text": "Email me at [email protected]" }
The above sentence would produce the following terms:
[ Email, me, at, [email protected] ]
while the standard
tokenizer would produce:
[ Email, me, at, john.smith, global, international.com ]
Configuration
editThe uax_url_email
tokenizer accepts the following parameters:
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example configuration
editIn this example, we configure the uax_url_email
tokenizer to have a
max_token_length
of 5 (for demonstration purposes):
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } } } POST my_index/_analyze { "analyzer": "my_analyzer", "text": "[email protected]" }
The above example produces the following terms:
[ john, smith, globa, l, inter, natio, nal.c, om ]