Make Your Config Cleaner and your Log Processing Faster with Logstash Metadata
With the release of Logstash 1.5 we have added the ability to add metadata to an event. The difference between regular event data and metadata is that metadata is not serialized by any outputs. This means any metadata you add is transient in the Logstash pipeline and will not be included in the output. Using this feature, one can add custom data to an event, perform additional filtering or add conditionals based on the metadata while the event flows through the Logstash pipeline. This will simplify your configuration and remove the need to define temporary fields.
To access the metadata fields you can use the standard field syntax:
[@metadata][foo]
Use Cases
Lets us consider some use cases to illustrate the power of metadata. In all our use cases, will
be using the rubydebug and the stdout output to check our transformation, so make sure you are correctly defining the
output codec with the metadata
option set to true.
Note: The rubydebug
codec used in the stdout output is currently
the only way to see what is in
@metadata
at output time.
output { stdout { codec => rubydebug { metadata => true } } }
Date filter
Since logs arrive in a wide variety of formats, grok is used to extract them, and the date filter to convert them to
ISO8601 and overwrite the
@timestamp
field with the timestamp from the log event. It happens frequently that users
omit to remove the source timestamp field after the conversion and overwrite, though.
Here's a rough example of how the new @metadata
field could be used with the date filter and prevent a temporary
timestamp field from making it into Elasticsearch:
grok { match => { "message" => '%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:[@metadata][timestamp]}\] “%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}” %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}' } } date { match => [ "[@metadata][timestamp]", "dd/MMM/YYYY:HH:mm:ss Z" ] }
Before Logstash 1.5, you would remove the redundant timestamp
field by adding the remove_field
line into the date
filter as I outlined above. Theoretically, that will be a slower operation than this one. That makes using the
@metadata
field a performance booster!
The @metadata
field act like a normal field and you can do all the operations or filtering on it. Use them as a scratchpad if you don't need to persist the information.
# Log sample: # 213.113.233.227 - server=A id=1234 memory_load=300 error_code=13 payload=12 event_start=1417193566 event_stop=1417793586
input { file { sincedb_path => '/dev/null' path => "/source/test.log" start_position => 'beginning' } } filter { grok { match => { "message" => "%{IP:ip} - %{DATA:[@metadata][components]}$" } } kv { source => "[@metadata][components]" } date { match => ["event_start", "UNIX"] target => "event_start" } date { match => ["event_stop", "UNIX"] target => "event_stop" } ruby { code => "event['@metadata']['duration'] = event['event_stop'] - event['event_start']" } if [@metadata][duration] > 100 { mutate { add_tag => "slow_query" add_field => { "[@metadata][speed]" => "slow_query" } } } else { mutate { add_field => { "[@metadata][speed]" => "normal" } } } } output { stdout { codec => rubydebug { metadata => true } } }
Elasticsearch output
Some plugins leverage the use of the metadata, like the elasticsearch input. It allows you to keep the document
information in a predefined
@metadata
field. This information is available to various parts of the Logstash pipeline, but will not be persisted in Elasticsearch documents.
input { elasticsearch { host => "localhost" # Store ES document metadata (_index, _type, _id) in metadata docinfo_in_metadata => true } } output { elasticsearch { document_id => "%{[@metadata][_id]}" index => "transformed-%{[@metadata][_index]}" type => "%{[@metadata][_type]}" } }
Create your own id from your event data
Out of the box, Elasticsearch provides an efficient way to create unique IDs for every documents that you are inserting. In most cases, you should let Elasticsearch generate the IDs. However, there are scenarios where you would want to generate an unique identifier in Logstash based on the content of the event. Using IDs based on event data lets Elasticsearch perform de-duplication. In our example, we will generate the IDs using the logstash-filter-fingerprint
and use the default hash method (SHA1).
To test it, use the following JSON event with this configuration:
{ "IP": "127.0.0.1", "message": "testing generated id"}
input { stdin { codec => json } } filter { fingerprint { source => ["IP", "@timestamp", "message"] target => "[@metadata][generated_id]" key => "my-key" } } output { elasticsearch { protocol => "http" host => "127.0.0.1" document_id => "%{[@metadata][generated_id]}" } stdout { codec => rubydebug { metadata => true } } }
Like in the previous examples, we are using the fieldref syntax to access the generated_id
in the @metadata
hash.
The Elasticsearch output will use this value as the document id, but the intermediate variable
generated_id
will not be saved as part of the _source
inside Elasticsearch.
If you do a query for the specific document using the generated ID you should see a similar document showing the saved information.
# curl -XGET "http://localhost:9200/logstash*/_search?q=_id:5f5b8e63da13c17405e940b5e8db703a19cd4485&pretty=1"
{ "took" : 8, "timed_out" : false, "_shards" : { "total" : 35, "successful" : 35, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "logstash-2015.09.03", "_type" : "logs", "_id" : "5f5b8e63da13c17405e940b5e8db703a19cd4485", "_score" : 1.0, "_source":{"IP":"127.0.0.1","message":"testing generated id","@version":"1","@timestamp":"2015-09-03T20:27:25.206Z","host":"sashimi"} } ] } }
Similarly, you can also use @metadata
as fieldref syntax in your configuration like any other fields:
"from server: %{[@metadata][source]}%"
Conclusion
As you have seen in the examples above, the addition of metadata provides a simple, yet convenient way to store intermediate results. This makes configuration less complex -- you don't have to use remove_field
explicitly. Also, we can reduce storage of unnecessary fields in Elasticsearch which helps reduce the size of your index. Metadata is a powerful addition to your Logstash toolset. Start using this feature today in your configuration!