Document Processing and Elasticsearch
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
An Overview
Raw documents from the source database or server may need some extra processing before being indexed in Elasticsearch. In this article, we consider a few different options for this processing.
Introduction
When working with a full-text search engine such as Elasticsearch, the need to perform some kind of transformation of the incoming data before it’s indexed is likely to emerge. This transformation is often called “Document Processing”, and it is an important piece of functionality for many applications.
Document processing can range from simply tagging documents with a pre-configured attribute, dynamically calculating one or more attributes based on document contents or complete document rewrites, both structurally and content-wise.
Elasticsearch is open for several different ways to process documents (or document contents), and which method to use highly depends on the application and system requirements. In the following sections we’ll look at three specific methods of doing this: the mapping, as a custom plugin or using an external service.
What is Document Processing?
Document processing is separate from text processing and is performed as a separate step before the text processing is performed. Where text processing works on individual fields, document processing is usually scoped around an entire document that would contain multiple fields.
Usually, the text processing is performed on the resulting field data after the document has been transformed. While it’s possible to tag data as already-analyzed, and thus skipping the usual text processing, the details of this is outside the scope of this article.
If you’re already using Elasticsearch, chances are that you’re actually doing some kind of document processing already, as the initial creation of the source documents on your application servers can certainly be considered a simple form of document processing. The initial fields are seeded and often this includes some de-normalization by including some contextual information or metadata about the current document.
Transform Field Using Script
A relatively unknown feature of the Elasticsearch mapping is the transform
field, which runs a small script on the source document before indexing it. This ties into the scripting module, so scripts can either be provided inline, as part of the mapping, or pre-compiled and loaded when Elasticsearch initially bootstraps.
There are some limitations to be aware of: the transform field must be configured when the index is initially created, and it’s not possible to change or add transformed fields afterwards. This is different from most other field types in Elasticsearch, which usually allows for new fields to be added to an existing index – a part of the “schemaless” feature of Elasticsearch.
Custom Plugin
If the use case calls for more elaborate processing than the transform
field allows for, such as asynchronous processing or enriching of the data, you may want to consider writing a custom plugin.
If the processing is simple enough, you can get by with writing a plugin that hooks into the text processing of Elasticsearch by writing a custom analyzer. However, that analyzer has to work within the confines of the Lucene text processing system, which is limited by the environment it runs in: it is executed synchronously from an indexing thread, so it’s in a path that is likely to be highly performance critical.
Not everything can be done in the context of an analyzer though. The most common example of this is asynchronous processing – or even things that should be performed asynchronously. Asynchronous processing may include things like using other (non-local) services or IO to provide context, for example consulting a separate database or service in order to add fields that are relevant to the document as part of the processing.
For these cases, the processing has to be done before the document is actually handed off to Elasticsearch for indexing. It may still be accomplished within the Elasticsearch process by writing a custom plugin. For example, we can write a plugin that adds a custom endpoint to Elasticsearch that pre-processes the incoming data before using the internal Elasticsearch Client interface to perform the actual indexing. At that level, doing asynchronous processing and leveraging other services is rather trivial.
Doing document processing within Elasticsearch means that you tie up the scaling of the document processing with the scaling of Elasticsearch. The workloads of these two parts may differ a lot, and as such it may be beneficial to extract the document processing out into an external service. This also makes it easier to update the document processing pipeline without having to perform any changes to your Elasticsearch cluster. This is the main reason why we don’t recommend relying on using plugins such as the mapper-attachment in production. While it’s great for testing and getting started – it takes almost no time to set up and enables indexing a whole host of file formats – it does not necessarily make for an efficient way to utilize the available resources. It also ties up memory on index-specific things rather than using it on speeding up searches, which is probably what you’d prefer.
External Systems and Pipelines
So, the preferred way to go – with respect to both flexibility and scalability – is performing the document processing outside of Elasticsearch. This decouples both the scaling and upgrading of Elasticsearch from the document processing. Elasticsearch and Lucene are well known to benefit a lot from using a lot of resources, and having to compete against other processes or services on the same node for these resources is counterproductive.
There are a ton of external systems that can perform different transformations. One such system is the Elasticsearch’ official Logstash, which is intended to pre-process logs, converting them from an unstructured flat log-line into a better structured JSON document that’s suitable for indexing.
For more complex document pre-processing, it’s possible to either use Hadoop, Spark, Storm and similar systems, or hook up a completely custom document preprocessing system using some kind of message broker, such as RabbitMQ. In most of these cases, the final step of actually shipping the final documents to Elasticsearch for indexing is an exercise left for the reader. In order to do this quickly and reliably, there’s a lot of time and hairs to be saved by looking into using a message broker that already has official support from both Elasticsearch and the document processing system. One such example would be using the RabbitMQ Elasticsearch plugin.
Using these external systems, it’s feasible to use tools like docsplit on one or more worker nodes to perform OCR, generate thumbnails etc. Due to the number of system dependencies and resource usage of such tools, they’re not a good fit to install on the Elasticsearch server, but it’s rather easy to use Docker containers and integrate that with your document processing system / service of choice. This gives us a superset of the features from the mapper-attachments plugin with the flexibility and scalability of the external systems.
Summary
Document processing makes it possible to enhance the documents being indexed to Elasticsearch in a variety of different ways.
There’s no “ultimate” solution that handles all document processing requirements for everyone, but that doesn’t mean document processing is an unsolved problem. Instead, there are multiple solutions to pick from depending on requirements and pre-existing services and expertise within the organization.
For small-scale transformations, the transform
field type may be just fine, and in some circumstances doing the neccessary processing using a plugin to the Elasticsearch analysis-service or as a custom input endpoint may be a simple, workable solution. More elaborate document processing calls for an external system to perform the task. While it is more complex to set up and manage, it is significantly more flexible with regard to the scaling and managing of updates and changes to the processing pipeline.