Fingerprint Analyzer
editFingerprint Analyzer
editThe fingerprint analyzer implements a
fingerprinting algorithm
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
Definition
editIt consists of:
- Tokenizer
- Token Filters (in order)
-
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (disabled by default)
- Fingerprint Token Filter
Example output
editPOST _analyze
{
"analyzer": "fingerprint",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configuration
editThe fingerprint analyzer accepts the following parameters:
|
|
The character to use to concate the terms. Defaults to a space. |
|
|
The maximum token size to emit. Defaults to |
|
|
A pre-defined stop words list like |
|
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the fingerprint analyzer to use the
pre-defined list of English stop words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_fingerprint_analyzer": {
"type": "fingerprint",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_fingerprint_analyzer",
"text": "Yes yes, Gödel said this sentence is consistent and."
}
The above example produces the following term:
[ consistent godel said sentence yes ]