MinHash Token Filter

edit

The min_hash token filter hashes each token of the token stream and divides the resulting hashes into buckets, keeping the lowest-valued hashes per bucket. It then returns these hashes as tokens.

The following are settings that can be set for a min_hash token filter.

Setting Description

hash_count

The number of hashes to hash the token stream with. Defaults to 1.

bucket_count

The number of buckets to divide the minhashes into. Defaults to 512.

hash_set_size

The number of minhashes to keep per bucket. Defaults to 1.

with_rotation

Whether or not to fill empty buckets with the value of the first non-empty bucket to its circular right. Only takes effect if hash_set_size is equal to one. Defaults to true if bucket_count is greater than one, else false.

Some points to consider while setting up a min_hash filter:

  • min_hash filter input tokens should typically be k-words shingles produced from shingle token filter. You should choose k large enough so that the probability of any given shingle occurring in a document is low. At the same time, as internally each shingle is hashed into to 128-bit hash, you should choose k small enough so that all possible different k-words shingles can be hashed to 128-bit hash with minimal collision.
  • choosing the right settings for hash_count, bucket_count and hash_set_size needs some experimentation.

    • to improve the precision, you should increase bucket_count or hash_set_size. Higher values of bucket_count or hash_set_size will provide a higher guarantee that different tokens are indexed to different buckets.
    • to improve the recall, you should increase hash_count parameter. For example, setting hash_count=2, will make each token to be hashed in two different ways, thus increasing the number of potential candidates for search.
  • the default settings makes the min_hash filter to produce for each document 512 min_hash tokens, each is of size 16 bytes. Thus, each document’s size will be increased by around 8Kb.
  • min_hash filter is used to hash for Jaccard similarity. This means that it doesn’t matter how many times a document contains a certain token, only that if it contains it or not.

Theory

edit

MinHash token filter allows you to hash documents for similarity search. Similarity search, or nearest neighbor search is a complex problem. A naive solution requires an exhaustive pairwise comparison between a query document and every document in an index. This is a prohibitive operation if the index is large. A number of approximate nearest neighbor search solutions have been developed to make similarity search more practical and computationally feasible. One of these solutions involves hashing of documents.

Documents are hashed in a way that similar documents are more likely to produce the same hash code and are put into the same hash bucket, while dissimilar documents are more likely to be hashed into different hash buckets. This type of hashing is known as locality sensitive hashing (LSH).

Depending on what constitutes the similarity between documents, various LSH functions have been proposed. For Jaccard similarity, a popular LSH function is MinHash. A general idea of the way MinHash produces a signature for a document is by applying a random permutation over the whole index vocabulary (random numbering for the vocabulary), and recording the minimum value for this permutation for the document (the minimum number for a vocabulary word that is present in the document). The permutations are run several times; combining the minimum values for all of them will constitute a signature for the document.

In practice, instead of random permutations, a number of hash functions are chosen. A hash function calculates a hash code for each of a document’s tokens and chooses the minimum hash code among them. The minimum hash codes from all hash functions are combined to form a signature for the document.

Example of setting MinHash Token Filter in Elasticsearch

edit

Here is an example of setting up a min_hash filter:

POST /index1
{
  "settings": {
    "analysis": {
      "filter": {
        "my_shingle_filter": { 
          "type": "shingle",
          "min_shingle_size": 5,
          "max_shingle_size": 5,
          "output_unigrams": false
        },
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,   
          "bucket_count": 512, 
          "hash_set_size": 1, 
          "with_rotation": true 
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_shingle_filter",
            "my_minhash_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "fingerprint": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

setting a shingle filter with 5-word shingles

setting min_hash filter to hash with 1 hash

setting min_hash filter to hash tokens into 512 buckets

setting min_hash filter to keep only a single smallest hash in each bucket

setting min_hash filter to fill empty buckets with values from neighboring buckets