How to create maintainable and reusable Logstash pipelines
Logstash is an open source data processing pipeline that ingests events from one or more inputs, transforms them, and then sends each event to one or more outputs. Some Logstash implementations may have many lines of code and may process events from multiple input sources. In order to make such implementations more maintainable, I will show how to increase code reusability by creating pipelines from modular components.
Motivation
It is often necessary for Logstash to apply a common subset of logic to events from multiple input sources. This is commonly achieved in one of the following two ways:
- Process events from several different input sources in a single pipeline so that common logic can easily be applied to all events from all sources. In such implementations, in addition to the common logic there is usually a significant amount of conditional logic. This approach therefore may result in Logstash implementations that are complicated and difficult to understand.
- Execute a unique pipeline for processing events from each unique input source. This approach requires duplicating and copying common functionality into each pipeline, which makes it difficult to maintain the common portions of the code.
The technique presented in this blog addresses the shortcomings in the above approaches by storing modular pipeline components in different files, and then constructing pipelines by combining these components. This technique can reduce pipeline complexity and can eliminate code duplication.
Modular pipeline construction
A Logstash configuration file consists of inputs, filters, and outputs which are executed by a Logstash pipeline. In more advanced setups it is common to have a Logstash instance executing multiple pipelines. By default, when you start Logstash without arguments, it will read a file called pipelines.yml
and will instantiate the specified pipelines.
Logstash inputs, filters, and outputs can be stored in multiple files which can be selected for inclusion into a pipeline by specifying a glob expression. The files that match a glob expression will be combined in alphabetical order. As the order of execution of filters is often important, it may be helpful to include numeric identifiers in file names to ensure that files are combined in the desired order.
Below we will define two unique pipelines that are a combination of several modular Logstash components. We store our Logstash components in the following files:
- Input declarations:
01_in.cfg
,02_in.cfg
- Filter declarations:
01_filter.cfg
,02_filter.cfg
,03_filter.cfg
- Output declarations:
01_out.cfg
Using glob expressions, we then define pipelines in pipelines.yml
to be composed of the desired components as follows:
- pipeline.id: my-pipeline_1 path.config: "<path>/{01_in,01_filter,02_filter,01_out}.cfg" - pipeline.id: my-pipeline_2 path.config: "<path>/{02_in,02_filter,03_filter,01_out}.cfg"
In the above pipelines configuration, the file 02_filter.cfg
is present in both pipelines, which demonstrates how the code that is common to both pipelines can be defined and maintained in a single file and also be executed by multiple pipelines.
Testing the pipelines
In this section we provide a concrete example of the files that will be combined into the unique pipelines defined in the above pipelines.yml
. We then run Logstash with these files, and present the generated output.
Configuration files
Input file: 01_in.cfg
This file defines an input that is a generator. The generator input is designed for testing Logstash, and in this case it will generate a single event.
input { generator { lines => ["Generated line"] count => 1 } }
Input file: 02_in.cfg
This file defines a Logstash input that listens on stdin.
input { stdin {} }
Filter file: 01_filter.cfg
filter { mutate { add_field => { "filter_name" => "Filter 01" } } }
Filter file: 02_filter.cfg
filter { mutate { add_field => { "filter_name" => "Filter 02" } } }
Filter file: 03_filter.cfg
filter { mutate { add_field => { "filter_name" => "Filter 03" } } }
Output file: 01_out.cfg
output { stdout { codec => "rubydebug" } }
Execute the pipeline
Starting Logstash without any options will execute the pipelines.yml
file that we previously defined. Run Logstash as follows:
./bin/logstash
As the pipeline my-pipeline_1
is executing a generator to simulate an input event, we should see the following output as soon as Logstash has finished initializing. This shows that the contents of 01_filter.cfg
and 02_filter.cfg
are executed by this pipeline as expected.
{ "sequence" => 0, "host" => "alexandersmbp2.lan", "message" => "Generated line", "@timestamp" => 2020-02-05T22:10:09.495Z, "@version" => "1", "filter_name" => [ [0] "Filter 01", [1] "Filter 02" ] }
As the other pipeline called my-pipeline_2
is waiting for input on stdin, we have not seen any events processed by that pipeline yet. Type something into the terminal where Logstash is running, and press Return to create an event for this pipeline. Once you have done this, you should see something like the following:
{ "filter_name" => [ [0] "Filter 02", [1] "Filter 03" ], "host" => "alexandersmbp2.lan", "message" => "I’m testing my-pipeline_2", "@timestamp" => 2020-02-05T22:20:43.250Z, "@version" => "1" }
We can see from the above that the logic from 02_filter.cfg
and 03_filter.cfg
is applied as expected.
Order of execution
Be aware that Logstash does not pay attention to the order of the files in the glob expression. It only uses the glob expression to determine which files to include, and then orders them alphabetically. That is to say, even if we were to change the definition of my-pipeline_2
so that 03_filter.cfg
appears in the glob expression before 02_filter.cfg
, each event would pass through the filter in 02_filter.cfg
before the filter defined in 03_filter.cfg
.
Conclusion
Using glob expressions allows Logstash pipelines to be composed from modular components, which are stored as individual files. This can improve code maintainability, reusability, and readability.
As a side note, in addition to the technique documented in this blog, pipeline-to-pipeline communication should also be considered to see if it can improve Logstash implementation modularity.