Phrase synonyms like a boss with the synonyms API

Learn how to use phrase synonyms with the synonyms API in real scenarios.

Synonyms have been a core Elasticsearch functionality since forever and you can wield them to your advantage to get great search results. People usually think that a synonym is just a pair of words equal to each other, but is that all there is?

The new synonyms API allows you to create and update synonyms quickly and easily, and the synonym_graph filter lets you handle multi word synonyms smoothly. In this article, we’ll explore different ways to configure synonyms and use them to solve a common, yet tricky problem.

Index

  1. The AI problem
  2. Creating synonyms with the synonyms API
  3. Testing it out
  4. Phrase synonyms support
  5. Expansion synonyms

The AI problem

Here’s the situation: We’re in the AI boom and you want your documents related to this technology to be the first search results people get. Picture a system with AI docs together with business intelligence docs. It has AI articles but also Adobe Illustrator (AI) articles. Let’s see how synonyms can help us create a user experience that meets today’s requirements.

Creating synonyms with the synonyms API

The new synonyms API allows you to create synonyms without uploading files or running additional commands in the nodes to update them, which often cause issues when the files are not consistent across nodes or working with Elastic Serverless.

Let’s begin by creating the synonyms:

PUT _synonyms/my-synonyms-set
{
  "synonyms_set": [
    {
      "synonyms": "AI, Artificial Intelligence"
    }
  ]
}

It’s very important to create the synonym set before the analyzer that’s going to use it.

Now, let’s configure our index to use synonyms. To have more flexible queries, we’ll be creating one field with synonyms and another one without. So title won’t have synonyms while title.synonyms will have them. The script is below:

PUT /synonyms-index
{
  "settings": {
    "analysis": {
      "filter": {
        "synonyms_filter": {
          "type": "synonym",
          "synonyms_set": "my-synonyms-set",
          "updateable": true
        }
      },
      "analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "synonyms_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "synonyms": {
            "type": "text",
            "analyzer": "standard",
            "search_analyzer": "my_search_analyzer"
          }
        }
      }
    }
  }
}

Note that we’ll use the synonyms in the field search analyzer and not as index analyzer so we will not be storing them in the index but rather, generating them with each query – trading some performance for greater flexibility. Using them in the analyzer instead of indexing them, allows you to update synonyms using the new API and use less disk space.

Testing it out

Let’s add some documents:

POST _bulk
{ "index" : { "_index" : "synonyms-index", "_id" : "1" } }
{ "title" : "Adobe Illustrator (AI) tutorial" }
{ "index" : { "_index" : "synonyms-index", "_id" : "2" } }
{ "title" : "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms." }
{ "index" : { "_index" : "synonyms-index", "_id" : "3" } }
{ "title" : "Business Intelligence: Course for young professionals" }

Our star document is #2. It talks about AI and what we want to promote among our users.

Now, let’s start our search without using synonyms:

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title": "AI"
    }
  }
}

As expected, we get the Adobe Illustrator course in the results:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.2330425,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "1",
        "_score": 1.2330425,
        "_source": {
          "title": "Adobe Illustrator (AI) tutorial"
        }
      }
    ]
  }
}

What if we now try using our field with synonyms?

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title.synonyms": "AI"
    }
  }
}
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1.2330425,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "1",
        "_score": 1.2330425,
        "_source": {
          "title": "Adobe Illustrator (AI) tutorial"
        }
      },
      {
        "_index": "synonyms-index",
        "_id": "2",
        "_score": 1.1102026,
        "_source": {
          "title": "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms."
        }
      },
      {
        "_index": "synonyms-index",
        "_id": "3",
        "_score": 0.52354836,
        "_source": {
          "title": "Business Intelligence: Course for young professionals"
        }
      }
    ]
  }
}

This is better but, what is the Business Intelligence document doing here?

Phrase synonyms support

In the previous example we used the token synonym filter which does not support multi word or phrase synonyms. This is why Business Intelligence was matched to AI. Artificial and Intelligence are not considered a phrase but single tokens. Let’s fix this!

It’s an easy fix. We need to use synonym_graph. This is a variant that allows you to handle multi word synonyms. Though it can only work as a search analyzer, using synonyms in the search phase can be advantageous compared to using them while indexing.

We can update the search analyzer without reindexing data by running the sequence below:

Closing the index:

POST /synonyms-index/_close

Editing settings. Note that now the type of filter is synonym_graph instead of synonym

PUT /synonyms-index/_settings
{
  "analysis": {
    "filter": {
      "synonyms_filter": {
        "type": "synonym_graph",
        "synonyms_set": "my-synonyms-set",
        "updateable": true
      }
    },
    "analyzer": {
      "my_search_analyzer": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "synonyms_filter"
        ]
      }
    }
  }
}

Let’s open the index again:

POST /synonyms-index/_open

And now let’s run the search:

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title.synonyms": "AI"
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.2330425,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "1",
        "_score": 1.2330425,
        "_source": {
          "title": "Adobe Illustrator (AI) tutorial"
        }
      },
      {
        "_index": "synonyms-index",
        "_id": "2",
        "_score": 1.1102026,
        "_source": {
          "title": "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms."
        }
      }
    ]
  }
}

Perfect! The Business Intelligence article is no longer there.

What happens if we explicitly search for artificial intelligence?

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title.synonyms": "artificial intelligence"
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.2330425,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "1",
        "_score": 1.2330425,
        "_source": {
          "title": "Adobe Illustrator (AI) tutorial"
        }
      },
      {
        "_index": "synonyms-index",
        "_id": "2",
        "_score": 1.1102026,
        "_source": {
          "title": "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms."
        }
      }
    ]
  }
}

Adobe Illustrator?

So what happened now? This is what I was talking about at the beginning of the article when I said we need to challenge the notion that synonyms are just two equivalent words. We also need to take into account the directionality of the expansion.

By default, if we say AI, Artificial Intelligence, it implies two things:

  1. AI is the same as Artificial Intelligence
  2. Artificial Intelligence is the same as AI

Number 2 is not true in this case. Adobe Illustrator is definitely not the same as Artificial Intelligence.

To corroborate this, we can use the _analyze API to see how our search terms are being transformed:

POST synonyms-index/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": "artificial intelligence"
}
{
  "tokens": [
    {
      "token": "ai",
      "start_offset": 0,
      "end_offset": 23,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "artificial",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "intelligence",
      "start_offset": 11,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

As you can see, we’re generating an ai token, which is present in the Adobe Illustrator document, thus creating an unwanted match.

Expansion synonyms

How do we finally fix it, then? We need to use single-direction tokens, where AI is the same as Artificial Intelligence but Artificial Intelligence is **NOT** the same as AI.

Thanks to the Synonyms API this is very straightforward. We can make a PUT call to the existing synonym set to update it:

PUT _synonyms/my-synonyms-set
{
  "synonyms_set": [
    {
      "synonyms": "AI => Artificial Intelligence"
    }
  ]
}

This change will replace any AI mentions with Artificial Intelligence. This way, Adobe Illustrator will NOT show up, even if AI is mentioned. If we did want it to show up, we could make this synonym: "AI => AI, Artificial Intelligence".

Let’s analyze again:

POST synonyms-index/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": "artificial intelligence"
}
{
  "tokens": [
    {
      "token": "artificial",
      "start_offset": 0,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "intelligence",
      "start_offset": 11,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

The ai token is gone.

Now, let’s check if search works as expected:

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title.synonyms": "artificial intelligence"
    }
  }
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.1102026,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "2",
        "_score": 1.1102026,
        "_source": {
          "title": "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms."
        }
      },
      {
        "_index": "synonyms-index",
        "_id": "3",
        "_score": 0.52354836,
        "_source": {
          "title": "Business Intelligence: Course for young professionals"
        }
      }
    ]
  }
}

Business Intelligence is there again! But this time, for a different reason. The default match query operator is OR and both documents include the word intelligence. If we want to run a more strict search and make sure all keywords are present, we can use match_phrase or change the operator parameter to AND.

GET synonyms-index/_search 
{
  "query": {
    "match": {
      "title.synonyms": {
        "query": "artificial intelligence",
        "operator": "AND"
      }
    }
  }
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.1102026,
    "hits": [
      {
        "_index": "synonyms-index",
        "_id": "2",
        "_score": 1.1102026,
        "_source": {
          "title": "Artificial Intelligence from zero to hero: The best techniques to master machine learning algorithms."
        }
      }
    ]
  }
}

Now finally, no matter how people search for it, our AI document will be the star of the results.

To finalize, let’s just clean the index and the synonyms set we created:

DELETE synonyms-index
DELETE _synonyms/my-synonyms-set

You can read more about the synonyms API in this article.

Conclusion

Synonyms are a very powerful tool to customize the search experience, so it is crucial to understand the available configurations to get the results that you want. The synonyms API allows you to create and update synonyms quickly and easily, and the synonym_graph filter lets you handle multi word synonyms smoothly.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself