Adding data to Elasticsearch
editAdding data to Elasticsearch
editYou have a number of options for getting your data into Elasticsearch, which is commonly referred to as ingesting or indexing your data. You can use Elastic Agent, Beats, Logstash, Elastic language clients, Workplace Search content connectors, or the Enterprise Search web crawler. Which option (or combination) you use largely depends on whether you are indexing general content or timestamped data.
- General content
- To index content like HTML pages, catalogs, and other files, you can use Workplace Search content connectors, the Enterprise Search web crawler, or send data directly to Elasticsearch from your application using one of the Elastic language clients.
- Timestamped data
-
The preferred way to index timestamped data is to use Elastic Agent. Elastic Agent is a single, unified way to add monitoring for logs, metrics, and other types of data to a host. It can also protect hosts from security threats, query data from operating systems, forward data from remote services or hardware, and more. Each Elastic Agent based integration includes default ingestion rules, dashboards, and visualizations so you can start analyzing your data right away. Fleet Management enables you to centrally manage all of your deployed Elastic Agents from Kibana.
If no Elastic Agent integration is available for your data source, you can use Beats to collect your data. Beats are data shippers designed to collect and ship a particular type of data from a server—you install a separate Beat for each type of data you want to collect. Modules that provide default configurations, Elasticsearch ingest pipeline definitions, and Kibana dashboards are available for some Beats, such as Filebeat and Metricbeat. No Fleet management capabilities are provided for Beats.
If neither Elastic Agent or Beats supports your data source, another alternative is to use Logstash. Logstash is an open source data collection engine with real-time pipelining capabilities that supports a wide variety of data sources. You might also use Logstash to persist incoming data to ensure data is not lost if there’s an ingestion spike, or if you need to send the data to multiple destinations.
Designing a data ingestion pipeline
editWhile you can send data directly to Elasticsearch, data ingestion pipelines often include additional steps to manipulate the data, ensure data integrity, or manage the data flow.
Data manipulation
editIt’s often necessary to sanitize, normalize, transform, or enrich your data before it’s indexed and stored in Elasticsearch.
- Elastic Agent and Beats processors enable you to manipulate the data at the edge. This is useful if you need to control what data is sent across the wire, or need to enrich the raw data with information available on the host.
- Elasticsearch ingest pipelines enable you to manipulate the data as it comes in. This avoids putting additional processing overhead on the hosts from which you’re collecting data.
- Logstash enables you to avoid heavyweight processing at the edge, but still manipulate the data before sending it to Elasticsearch. This also enables you to send the processed data to multiple destinations.
One reason for preprocessing your data is to control the structure of the data that’s indexed into Elasticsearch—the data schema. For example, you might use an ingest pipeline to map your data to the Elastic Common Schema (ECS). Alternatively, you can use runtime fields at query time to:
- Start working with your data without needing to understand how it’s structured
- Add fields to existing documents without reindexing your data
- Override the value returned from an indexed field
- Define fields for a specific use without modifying the underlying schema
Data integrity
editLogstash can boost data resiliency for important data that you don’t want to lose. Logstash offers an on-disk persistent queue (PQ) that can absorb bursts of events without an external buffering mechanism. It attempts to deliver messages stored in the PQ until delivery succeeds at least once.
The Logstash dead letter queue (DLQ) provides on-disk storage for events that Logstash can’t process, giving you a chance to evaluate them. You can use the dead_letter_queue input plugin to easily reprocess DLQ events.
Data flow
editIf you need to collect data from multiple Beats or Elastic Agents, consider using Logstash as a proxy. Logstash can receive data from multiple endpoints, even on different networks, and send the data on to Elasticsearch through a single firewall rule. You get more security for less work than if you set up individual rules for each endpoint.
Logstash can send to multiple outputs from a single pipeline to help you get the most value from your data.