This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.
UAX URL email tokenizer
editUAX URL email tokenizer
editThe uax_url_email
tokenizer is like the standard
tokenizer except that it
recognises URLs and email addresses as single tokens.
Example output
editresp = client.indices.analyze( tokenizer="uax_url_email", text="Email me at [email protected]", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'uax_url_email', text: 'Email me at [email protected]' } ) puts response
const response = await client.indices.analyze({ tokenizer: "uax_url_email", text: "Email me at [email protected]", }); console.log(response);
POST _analyze { "tokenizer": "uax_url_email", "text": "Email me at [email protected]" }
The above sentence would produce the following terms:
[ Email, me, at, [email protected] ]
while the standard
tokenizer would produce:
[ Email, me, at, john.smith, global, international.com ]
Configuration
editThe uax_url_email
tokenizer accepts the following parameters:
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example configuration
editIn this example, we configure the uax_url_email
tokenizer to have a
max_token_length
of 5 (for demonstration purposes):
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_analyzer", text="[email protected]", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'uax_url_email', max_token_length: 5 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: '[email protected]' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_analyzer: { tokenizer: "my_tokenizer", }, }, tokenizer: { my_tokenizer: { type: "uax_url_email", max_token_length: 5, }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_analyzer", text: "[email protected]", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "[email protected]" }
The above example produces the following terms:
[ john, smith, globa, l, inter, natio, nal.c, om ]