IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
UAX URL email tokenizer
editUAX URL email tokenizer
editThe uax_url_email
tokenizer is like the standard
tokenizer except that it
recognises URLs and email addresses as single tokens.
Example output
editresponse = client.indices.analyze( body: { tokenizer: 'uax_url_email', text: 'Email me at [email protected]' } ) puts response
POST _analyze { "tokenizer": "uax_url_email", "text": "Email me at [email protected]" }
The above sentence would produce the following terms:
[ Email, me, at, [email protected] ]
while the standard
tokenizer would produce:
[ Email, me, at, john.smith, global, international.com ]
Configuration
editThe uax_url_email
tokenizer accepts the following parameters:
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example configuration
editIn this example, we configure the uax_url_email
tokenizer to have a
max_token_length
of 5 (for demonstration purposes):
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'uax_url_email', max_token_length: 5 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: '[email protected]' } ) puts response
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "[email protected]" }
The above example produces the following terms:
[ john, smith, globa, l, inter, natio, nal.c, om ]