Content extraction

edit

Content extraction

edit

Connectors use the Elastic ingest attachment processor to extract file contents. The processor extracts files using the Apache Tika text extraction library. The logic for content extraction is defined in utils.py.

While intended primarily for PDF and Microsoft Office formats, you can use any of the supported formats.

Enterprise Search uses an Elasticsearch ingest pipeline to power the web crawler’s binary content extraction. The default pipeline, ent-search-generic-ingestion, is automatically created when Enterprise Search first starts.

You can view this pipeline in Kibana. Customizing your pipeline usage is also an option. See Index-specific ingest pipelines.

Supported file types

edit

The following file types are supported:

  • .txt
  • .py
  • .rst
  • .html
  • .markdown
  • .json
  • .xml
  • .csv
  • .md
  • .ppt
  • .rtf
  • .docx
  • .odt
  • .xls
  • .xlsx
  • .rb
  • .paper
  • .sh
  • .pptx
  • .pdf
  • .doc

The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs. Expand the archive file and make individual uncompressed files available for the connector to process.