How to Use Fuzzy Searches in Elasticsearch
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
Elasticsearch's Fuzzy query is a powerful tool for a multitude of situations. Username searches, misspellings, and other funky problems can oftentimes be solved with this unconventional query. In this article we clarify the sometimes confusing options for fuzzy searches, as well as dive into the internals of Lucene's FuzzyQuery.
Introduction
Searching natural language is inherently imprecise. Since computers can't comprehend natural language, there are a plethora of approaches to search, each with its own advantages and drawbacks. Lucene, the technology underlying Elasticsearch, is a swiss-army knife composed of many text processing tools. Each tool within it is a heuristic, an algorithmic shortcut in lieu of true linguistic comprehension. Some of these tools, like the Snowball stemmer and the Metaphone phonetic analyzer, are quite sophisticated. These tools mimic grammatical and phonetic aspects of language comprehension respectively. Other tools are very basic, like the prefix query type, which simply matches the beginning letters of words. Fuzzy queries sit somewhere in the middle of this toolchest in terms of sophistication; they find words that need at most a certain number of character modifications, known as 'edits', to match the query. For instance, a fuzzy search for 'ax' would match the word 'axe', since only a single deletion, removing the 'e', is required to match the two words.
A Simple Fuzzy Match
Fuzzy queries can most easily be performed through additional arguments to the match query type, as seen in the example below this paragraph. In the example, the final request, a search for 'vaacuum', which has an extra A, should still turn up the 'Vacuum' product. The fuzziness argument specifies that the results match with a maximum edit distance of 2. It should be noted that fuzziness should only be used with values of 1 and 2, meaning a maximum of 2 edits between the query and a term in a document is allowed. Larger differences are far more expensive to compute efficiently and are not processed by Lucene. The official documentation still refers to setting fuzziness to float values, like 0.5, but these values are in-fact deprecated, and are harder to reason about.
# Create the index
PUT /fuzzy_products
# Create the product mapping
PUT /fuzzy_products/product/_mapping
{
"product": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}
# Upload some documents
PUT /fuzzy_products/product/1
{"name": "Vacuum Cleaner"}
PUT /fuzzy_products/product/2
{"name": "Turkey Baster"}
# Perform a fuzzy search!
POST /fuzzy_products/product/_search
{
"query": {
"match": {
"name": {
"query": "Vacuummm",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}
Determining Edit Distance
The metric used by fuzzy queries to determine a match is the Damerau-Levenshtein distance formula. Simply put, the Damerau-Levenshtein distance between two pieces of text is the number of insertions, deletions, substitutions, and transpositions needed to make one string match the other. As an example, the Levenshtein distance between the words “ax” and “axe” is 1 due to the single deletion required.
The Damerau-Levenshtein distance formula is a modification of the classic Levenshtein distance formula, altering it by adding transposition as a valid operation. Both formulas are supported, with Damerau-Levenshtein being the default, and classic Levenshtein being selectable by setting transpositions to false in the query. The utility of transpositions can be seen in the case of comparing the strings 'aex' and 'axe'. When using the classic Levenshtein distance formula, 'aex' is not one, but two edits away; the 'e' must be deleted, after which a new 'e' is inserted in the proper place, while in Damerau-Levenshtein, a single operation, swapping the 'e' and the 'x', suffices. Extrapolating out from this, using classic Levenshtein would mean that 'aex' is as far away from 'axe' as 'faxes' is; an example showing why Damerau-Levenshtein makes greater intuitive sense in most cases.
When dealing with fuzzy searches in particular, it is vital to understand that in elasticsearch text is first run through an analyzer before being made available for search. When data is indexed it is processed into what are known as 'terms', the actual searchable units in the database. It is the analyzed terms (what terms are is covered in Elasticsearch from the Bottom Up), not the actual stored documents that are searched. That means that when performing fuzzy queries, the query text may be compared to an unanticipated term value as a result of analysis, leading to sometimes confusing results. This also means that if synonyms are enabled on a field the synonyms may be matched, even if that word does not appear at all in the source text. For instance, if one were to use a fuzzy query over an ngram analyzed field, the results would likely be bizarre, as ngrams break words up into many small letter combinations, many of which are only an edit or two away, though the actual words involved are quite dissimilar. This also means that if using say, a snowball analyzer, a fuzzy search for 'running', will be stemmed to 'run', but will not match the misspelled word 'runninga', which stems to 'runninga', because 'run' is more than 2 edits away from 'runninga'. This can cause quite a bit of confusion, and for this reason, it often makes sense only to use the simple analyzer on text intended for use with fuzzy queries, possibly disabling synonyms as well. To clarify this, a diagram showing a fuzzy query running against a snowball analyzed document can be seen below.
The Different Types of Fuzzy Searches
Multiple types of fuzzy search are supported by elasticsearch and the differences can be confusing. The list below attempts to disambiguate these various types.
- match query + fuzziness option: Adding the fuzziness parameter to a match query turns a plain match query into a fuzzy one. Analyzes the query text before performing the search.
- fuzzy query: The elasticsearch fuzzy query type should generally be avoided. Acts much like a term query. Does not analyze the query text first.
- fuzzy_like_this/fuzzy_like_this_field: A more_like_this query, but supports fuzziness, and has a tuned scoring algorithm that better handles the characteristics of fuzzy matched results.*
- suggesters: Suggesters are not an actual query type, but rather a separate type of operation (internally built on top of fuzzy queries) that can be run either alongside a query, or independently. Suggesters are great for 'did you mean' style functionality.
A match query with the fuzziness parameter set is perhaps the most versatile of the fuzzy queries. The fuzzy query type supports the exact same behavior, except it does not allow for any analysis for the query text. Additionally, the fuzzy query type is a subset of the functionality of a match query makes it more confusing than useful.
The fuzzy_like_this, or FLT, queries are useful for providing recommendations based on a large piece of source text that may contain misspellings or other inaccuracies separated by edit distance. There are two different queries in this category fuzzy_like_this and fuzzy_like_this_field, which both provide the same functionality, the latter query simplifying the syntax for the case where only a single field is used. These queries take a parameter, like_text, consisting of a large piece of text, say the body of an article, and try to find documents 'like' that one. The text in the article is checked, and query terms are weighted based their frequency in the like_text with special tweaks (disabling the coord factor, source text based IDF) for combining the various terms from the source text. FLT queries work best for cases where the corpus contains a large number of misspellings, otherwise a standard more_like_this query will have better performance. More detail can be found on the official documentation for Fuzzy More Like This.
Suggesters are fantastic in the case where a user types in 'New Yrok', for a search, having intended to type in 'New York'. Applications can suggesters to present a 'did you mean' box in the UI recommending a potential correction. Beware that for the specific case of type-ahead search the completion API offers better performance given the latency sensitive nature of type-ahead searches. More information on suggesters can be found on the search suggesters page.
Performance Considerations
Even though Lucene's Levenshtein distance implementation is state of the art, and quite fast, it is still much slower than a plain match query. The runtime of the query grows with the number of unique terms in the index. That is to say, when performing a fuzzy search the main criteria is not how many documents will be returned, but how many unique terms across the cluster there are for the fields being searched. If there are 100 documents with 10,000 unique words apiece, searching that index will be slower than searching 10,000 documents where the field being searched only has 100 unique words.
The primary reason for this slowness is that a standard match query can quickly check the terms index, an internal datastructure used by Lucene, for a match and find documents extremely quickly using binary search. This process is fast even for large dictionaries since binary searches scale well. Fuzzy queries, on the other hand, use a more advanced algorithm involving a DFA which must process a large number of terms. Processing the much larger number of terms required for a fuzzy search is always slower than a simple binary search. As an example, running a fuzzy search on the English language Wikipedia dataset takes approximately 320ms for the term “history” given a min_similarity of 2, while a plain match search for the same term runs in a mere 35ms, an order of magnitude difference. Given a min_similarity of 1, the search time is only 150ms. It should be noted that these tests were not conducted in a statistically sound manner, and these times can be affected significantly by factors particular to individual applications. The Wikipedia dataset in particluar, has many more terms than a large number of common use cases.
Performance can be significantly improved by requiring that matches have exact prefix matches with the query. This dramatically shortens the search space at the expense of not finding words with a misspelling toward the front. The longer the prefix length, the faster the speedup. Specifying a prefix_length of 1 cuts the 320ms query mentioned above down to a mere 104ms. Increasing this number yielded further speed gains. Generally, it is advisable to require a prefix for datasets with large numbers of terms or performance will be poor, oftentimes in the 100s of milliseconds.
The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query. Cutting down the query terms has a negative effect, however, in that some valid results may not be found due to early termination of the query. It is important to understand that the max_expansions query limit works at the shard level, meaning that even if set to 1, multiple terms may match, all coming from different shards. This behavior can make it seem as if max_expansions is not in effect, so beware that counting unique terms that come are returned is not a valid way to determine if max_expansions is working.
Alternatives to Fuzzy Matching
Fuzzy matching isn't always the right tool for the job, oftentimes imprecise matches can be found through other techniques. The phonetic analysis plugin contains a number of fascinating tools for approximating matches, such as the metaphone analyzer, which finds words that sound similar to other words. For instance, if what is required is making sure words like run and ran are both considered equivalent, a smart analyzer like a Snowball analyzer is preferable. Alternatively, for checking misspellings, N-gram analysis as described in this short tutorial can run quite a bit faster at query time, depending on the dataset. N-Grams do, however, come at the cost of additional storage / memory usage, slightly more index-time processing, and a long tail of false positives after good matches.
More Reading
- For a more in-depth look at the performance of Fuzzy Queries in elasticsearch, Lucene core contributor Mike McCandless has written a fascinating blog post on the topic. There is also a video, Finite State Automata in Lucene and Solr, of a presentation by Dawid Weiss that covers Levenshtein and other automata in Lucene.
- If you would like to learn more about Levenshtein Automata, Nick Johnson has a great post Damn Cool Algorithms: Levenshtein Automata.