The Future of Attachments for Elasticsearch and .NET
For a long time, Elasticsearch has supported the indexing of attachments through the mapper-attachments plugin. Installing this plugin has provided the capability to index Word documents, PDFs as well as many other text-based document attachments, extracting the content from each file including metadata such as content type, author and keywords and making it searchable in Elasticsearch.
The Not Too Distant Past
Working with the mapper-attachments plugin has historically been a slightly awkward affair with NEST, the official .NET Elasticsearch high level client. From NEST 2.3.3 onwards, we've introduced an Attachment
type to make working with attachments a much smoother experience. In this post, we'll walk through some typical use cases in working with the plugin and the Attachment
type with Elasticsearch 5.0, and provide an introduction to the ingest-attachment processor plugin, one of the processors available in the suite of processors for the ingest node. Since the mapper-attachments plugin is deprecated in 5.0 and will be removed in 6.0, the ingest-attachment processor plugin is the recommended way to index attachments in Elasticsearch 5.0.
Installation
To get started with indexing attachments, the first step is to install the mapper-attachments plugin. For the purposes of this post, I'm going to use Elasticsearch 5.0. As with all plugins in Elasticsearch, installation is handled by calling the elasticsearch-plugin.bat
script within the Elasticsearch bin
directory
elasticsearch-plugin.bat install mapper-attachments
After successfully installing the plugin, it will be available to use when the node is started, or shutdown and restarted. If you're using the plugin with a version of Elasticsearch prior to 2.2, a specific version of the mapper-attachments plugin will be needed; consult the legacy documentation to understand which version needs to be installed for your environment.
Document Definition and Mapping
Once the plugin is installed and our node is running, we're all ready to index our first attachment. To keep things simple, we'll use a simple Word document saved in .docx
format whose content contains the following
Our document Plain Old CLR Object (POCO) type looks like the following
public class Document { public int Id { get; set; } public string Path { get; set; } public Attachment Attachment { get; set; } }
It contains an id to uniquely identify the document, a path specifying where the original file is located on a file share and finally, the attachment that will be indexed.
Now that we have a POCO type definition for the document, let's create an index and a mapping for it. For working with Elasticsearch 5.0 from .NET, we can use the 5.x release candidate of NEST
var documentsIndex = "documents"; var connectionSettings = new ConnectionSettings() .InferMappingFor<Document>(m => m .IndexName(documentsIndex) ); var client = new ElasticClient(connectionSettings); var indexResponse = client.CreateIndex(documentsIndex, c => c .Settings(s => s .Analysis(a => a .Analyzers(ad => ad .Custom("windows_path_hierarchy_analyzer", ca => ca .Tokenizer("windows_path_hierarchy_tokenizer") ) ) .Tokenizers(t => t .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph .Delimiter('\\') ) ) ) ) .Mappings(m => m .Map<Document>(mp => mp .AutoMap() .AllField(all => all .Enabled(false) ) .Properties(ps => ps .Text(s => s .Name(n => n.Path) .Analyzer("windows_path_hierarchy_analyzer") ) .Attachment(a => a .Name(n => n.Attachment) .NameField(nf => nf .Name(n => n.Attachment.Name) .Store() ) .FileField(ff => ff .Name(n => n.Attachment.Content) .Store() ) .ContentTypeField(ct => ct .Name(n => n.Attachment.ContentType) .Store() ) .ContentLengthField(clf => clf .Name(n => n.Attachment.ContentLength) .Store() ) .DateField(df => df .Name(n => n.Attachment.Date) .Store() ) .AuthorField(af => af .Name(n => n.Attachment.Author) .Store() ) .TitleField(tf => tf .Name(n => n.Attachment.Title) .Store() ) .KeywordsField(kf => kf .Name(n => n.Attachment.Keywords) .Store() ) ) ) ) ) );
The connection settings use a neat feature of the NEST client that allows a POCO type to be associated with a particular index name; that is, when a Document
type is specified as the generic type to be indexed, searched, etc., NEST will use the inferred index name specified on connection settings if no index name is specified on the request. Document type names can also be inferred in this way if you want to use a different type name to the camel cased POCO name that NEST will infer from the POCO name by default.
The mapping for the Document
type defines a custom analyzer for the Path
property that uses the path_hierarchy tokenizer to provide search across path hierarchies. Since this example is running on Windows, the tokenizer uses the \
character as the path delimiter. Additionally, the _all
field has been disabled within the mapping as it is not needed in our example. Finally, the metadata fields that we are interested in are mapped for the attachment type.
After the create index request is executed, the index will be created. The mapping for the Document
type can be inspected with the following
var mappingResponse = client.GetMapping<Document>();
This returns the following, demonstrating that the mapping has been created as expected
{ "documents" : { "mappings" : { "document" : { "_all" : { "enabled" : false }, "properties" : { "attachment" : { "type" : "attachment", "fields" : { "content" : { "type" : "text", "store" : true }, "author" : { "type" : "text", "store" : true }, "title" : { "type" : "text", "store" : true }, "name" : { "type" : "text", "store" : true }, "date" : { "type" : "date", "store" : true }, "keywords" : { "type" : "text", "store" : true }, "content_type" : { "type" : "text", "store" : true }, "content_length" : { "type" : "float", "store" : true }, "language" : { "type" : "text" } } }, "id" : { "type" : "integer" }, "path" : { "type" : "text", "analyzer" : "windows_path_hierarchy_analyzer" } } } } } }
Indexing and Searching our first Attachment
Now that the index and mapping are in place, it's time to index the attachment
var directory = Directory.GetCurrentDirectory(); var base64File = Convert.ToBase64String(File.ReadAllBytes(Path.Combine(directory, "example_one.docx"))); client.Index(new Document { Id = 1, Path = @"\\share\documents\examples\example_one.docx", Attachment = new Attachment { Content = base64File } });
This is synonymous with the following curl request
curl -XPUT "http://localhost:9200/documents/document/1" -d' { "id": 1, "path": "\\\\share\\documents\\examples\\example_one.docx", "attachment": "... base64 encoded attachment ..." }'
Once indexed, searching the content of the attachment is a straightforward affair
var searchResponse = client.Search<Document>(s => s .Query(q => q .Match(m => m .Field(a => a.Attachment.Content) .Query("NEST") ) ) );
Using NEST, a document field within Elasticsearch can be referenced using a member access lambda expression against the respective POCO type property name. The search result returned for the query is as follows
{ "took" : 31, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.2568969, "hits" : [ { "_index" : "documents", "_type" : "document", "_id" : "1", "_score" : 0.2568969, "_source" : { "id" : 1, "path" : "\\\\share\\documents\\examples\\example_one.docx", "attachment" : "... base64 encoded attachment ..." } } ] } }
and a Document
instance constructed from the _source
can be accessed via searchResponse.Documents.First()
.
We can also search on metadata extracted from the attachment
searchResponse = client.Search<Document>(s => s .StoredFields(f => f .Field(d => d.Attachment.Content) .Field(d => d.Attachment.ContentType) .Field(d => d.Attachment.ContentLength) .Field(d => d.Attachment.Author) .Field(d => d.Attachment.Title) .Field(d => d.Attachment.Date) ) .Query(q => q .Match(m => m .Field(a => a.Attachment.ContentType) .Query("application") ) ) );
Since all of the fields in the attachment have store
enabled in the mapping, the extracted metadata field values can be returned as above using the stored_fields parameter on the search request. Setting the content
field with store
enabled in the mapping is useful in scenarios where you want to retrieve the extracted content or perform highlighting on it.
The search result for the previous query is as follows
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.25316024, "hits" : [ { "_index" : "documents", "_type" : "document", "_id" : "1", "_score" : 0.25316024, "fields" : { "attachment.date" : [ "2016-08-30T05:50:00.000Z" ], "attachment.content" : [ "The Present and Future of Attachments\n\nThis is a sample document to demonstrate indexing attachments using NEST and the new Attachment type\n\n" ], "attachment.content_type" : [ "application/vnd.openxmlformats-officedocument.wordprocessingml.document" ], "attachment.author" : [ "Russ Cam" ], "attachment.content_length" : [ 11726.0 ] } } ] } }
And it is possible to construct Attachment
instances from the hit fields
values using
var attachments = searchResponse.Hits.Select(h => new Attachment { Author = h.Fields.ValueOf<Document, string>(d => d.Attachment.Author), Content = h.Fields.ValueOf<Document, string>(d => d.Attachment.Content), ContentLength = h.Fields.ValueOf<Document, long?>(d => d.Attachment.ContentLength), ContentType = h.Fields.ValueOf<Document, string>(d => d.Attachment.ContentType), Date = h.Fields.ValueOf<Document, DateTime?>(d => d.Attachment.Date), Title = h.Fields.ValueOf<Document, string>(d => d.Attachment.Title) } );
The Attachment
type within NEST takes care of accessing the correct values from the fields
property in the response, based on member access lambda expressions on the properties of the Attachment
type.
Explicit Metadata fields
The mapper-attachments plugin also allows explicit metadata fields to be sent at index time, along with the base64 encoded attachment. This can be useful in cases where you don't want to rely on a metadata value extracted from the attachment. For example, we may be indexing Microsoft Word documents in both the older .doc
and newer .docx
formats and wish to explicitly control the content type for the latter to align it with the content type of the former. The NEST Attachment
type can handle this for us
client.Index(new Document { Id = 1, Path = @"\\share\documents\examples\example_one.docx", Attachment = new Attachment { Content = base64File, ContentType = "application/msword" } });
Search can then be performed on content type
var searchResponse = client.Search<Document>(s => s .Query(q => q .Match(m => m .Field(a => a.Attachment.ContentType) .Query("msword") ) ) );
The document _source
with explicit metadata fields now contains two properties, _content
and _content_type
, the names used when explicit metadata fields are sent for both content and content type, respectively.
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.25811607, "hits" : [ { "_index" : "documents", "_type" : "document", "_id" : "1", "_score" : 0.25811607, "_source" : { "id" : 1, "path" : "\\\\share\\documents\\examples\\example_one.docx", "attachment" : { "_content" : "... base64 encoded attachment ...", "_content_type" : "application/msword" } } } ] } }
Again, the Attachment
type takes care of deserializing the source of an attachment into an Attachment
instance, with properties set to those explicitly specified in the source.
Gotchas
Whilst the mapper-attachment plugin works well, there are some gotchas to be aware of. For example, if no attachment fields are mapped with store
enabled and a document is indexed only with the base64 encoded attachment sent, then a search query to return metadata fields using the stored_fields
parameter will return the original base64 encoded attachment as the value for each requested field. One can exclude the original base64 encoded content from the _source
document using source exclude feature to mitigate this.
The Future
As previously mentioned, the mapper-attachments plugin is deprecated for Elasticsearch 5.0 and will be removed in 6.0. But fear not however for the future is bright! The ingest-attachment processor plugin, part of the ingest node in Elasticsearch 5.0, replaces the mapper-attachments plugin, providing a more predictable experience over its predecessor. Since the extraction process now happens before indexing of the document takes place within an index request, the extracted metadata fields are now stored in the source field and returned with the rest of the source document in a search request. Let's see an example.
Installation
Similarly to the mapper-attachments plugin, installation of the ingest-attachment plugin is handled by calling the elasticsearch-plugin.bat
script within the Elasticsearch bin
directory
elasticsearch-plugin.bat install ingest-attachment
Again, start or restart your node after installing this plugin. Now, with at least one ingest node in the Elasticsearch cluster, we're ready to start working with attachments.
Mappings and Pipelines
Mapping with ingest-attachment is a little different compared to how attachments need to be mapped with mapper-attachments plugin. Gone is the need to map the attachment using the bespoke attachment
type and instead, we can specify the field that we are going to send the base64 encoded attachment to Elasticsearch in, along with an object
mapping that will receive the extracted attachment metadata from the ingest-attachment processor pipeline.
With a slightly updated POCO, the mapping now looks like the following
public class Document { public int Id { get; set; } public string Path { get; set; } public string Content { get; set; } public Attachment Attachment { get; set; } } var indexResponse = client.CreateIndex(documentsIndex, c => c .Settings(s => s .Analysis(a => a .Analyzers(ad => ad .Custom("windows_path_hierarchy_analyzer", ca => ca .Tokenizer("windows_path_hierarchy_tokenizer") ) ) .Tokenizers(t => t .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph .Delimiter('\\') ) ) ) ) .Mappings(m => m .Map<Document>(mp => mp .AllField(all => all .Enabled(false) ) .Properties(ps => ps .Number(n => n .Name(nn => nn.Id) ) .Text(s => s .Name(n => n.Path) .Analyzer("windows_path_hierarchy_analyzer") ) .Object<Attachment>(a => a .Name(n => n.Attachment) .Properties(p => p .Text(t => t .Name(n => n.Name) ) .Text(t => t .Name(n => n.Content) ) .Text(t => t .Name(n => n.ContentType) ) .Number(n => n .Name(nn => nn.ContentLength) ) .Date(d => d .Name(n => n.Date) ) .Text(t => t .Name(n => n.Author) ) .Text(t => t .Name(n => n.Title) ) .Text(t => t .Name(n => n.Keywords) ) ) ) ) ) ) );
The mapping uses the text
field to map all of the string properties so that they are analyzed. In fact, we can take advantage of a feature within NEST known as automapping to simplify this mapping further; Automapping will infer the document mapping to send to Elasticsearch based on the types of the properties on the POCO. The simpler mapping is
var indexResponse = client.CreateIndex(documentsIndex, c => c .Settings(s => s .Analysis(a => a .Analyzers(ad => ad .Custom("windows_path_hierarchy_analyzer", ca => ca .Tokenizer("windows_path_hierarchy_tokenizer") ) ) .Tokenizers(t => t .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph .Delimiter('\\') ) ) ) ) .Mappings(m => m .Map<Document>(mp => mp .AutoMap() .AllField(all => all .Enabled(false) ) .Properties(ps => ps .Text(s => s .Name(n => n.Path) .Analyzer("windows_path_hierarchy_analyzer") ) .Object<Attachment>(a => a .Name(n => n.Attachment) .AutoMap() ) ) ) ) );
As before, the mapping can be checked with
var mappingResponse = client.GetMapping<Document>();
Now that the mapping is in place, an ingest pipeline can be created to use for attachment processing
client.PutPipeline("attachments", p => p .Description("Document attachment pipeline") .Processors(pr => pr .Attachment<Document>(a => a .Field(f => f.Content) .TargetField(f => f.Attachment) ) .Remove<Document>(r => r .Field(f => f.Content) ) ) );
This is akin to the following curl request
curl -XPUT "http://localhost:9200/_ingest/pipeline/attachments" -d' { "description": "Document attachment pipeline", "processors": [ { "attachment": { "field": "content", "target_field": "attachment" } }, { "remove": { "field": "content" } } ] }'
The attachment processor configuration allows control over which properties to extract, extracting all properties by default. A remove processor is also added to the ingest pipeline to remove and hence not store the base64 encoded attachment sent in the content
field; since the file exists on a file share already and the extracted content will be indexed into the content field of the attachment
field, keeping the original attachment content around in Elasticsearch is superfluous to our use case. In fact, we could simplify this example further by sending the base64 encoded attachment as the value of the attachment
field to Elasticsearch instead of using the content
field, still specifying the attachment
as the target field as before, and removing the content
field altogether. A series of blog posts will be diving deeper into ingest node and pipelines if you're eager for more details and use cases; for a teaser, take a look at Ingesting and Exploring Scientific Papers with the ingest-attachment processor plugin and our Elasticsearch as a service offering, Elastic Cloud.
Indexing and Searching in the Brave New World
We're now ready to roll with indexing our attachment! The base64 encoded attachment is now passed in the Content
field on the Document
POCO and the id of the pipeline to use is also specified on the request.
var directory = Directory.GetCurrentDirectory(); var base64File = Convert.ToBase64String(File.ReadAllBytes(Path.Combine(directory, "example_one.docx"))); client.Index(new Document { Id = 1, Path = @"\\share\documents\examples\example_one.docx", Content = base64File }, i => i.Pipeline("attachments"));
For this to all work, our Elasticsearch cluster needs to have at least one ingest node in it and, if you need to process lots of attachments, it is recommended to have dedicated ingest nodes since the extraction process can be a resource intensive operation.
With our document indexed, searching is as straightforward as before
var searchResponse = client.Search<Document>(s => s .Query(q => q .Match(m => m .Field(a => a.Attachment.Content) .Query("NEST") ) ) );
which returns the following search result
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.2568969, "hits" : [ { "_index" : "documents", "_type" : "document", "_id" : "1", "_score" : 0.2568969, "_source" : { "path" : "\\\\share\\documents\\examples\\example_one.docx", "attachment" : { "date" : "2016-08-30T05:48:00Z", "content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "author" : "Russ Cam", "language" : "en", "content" : "The Present and Future of Attachments\n\nThis is a sample document to demonstrate indexing attachments using NEST and the new Attachment type", "content_length" : 141 }, "id" : 1 } } ] } }
All of the extracted metadata from the attachment appears within the _source
field under the attachment
field and the NEST Attachment
type takes care of correctly deserializing this into an Attachment
instance on our Document
POCO.
Conclusion
PDFs, Word Documents, Powerpoint presentations, Excel spreadsheets and the like, brace yourselves, Ingest is here! No longer will your content remain locked away.