Watching the watches: Writing, debugging and testing watches
Editor's Note (August 3, 2021): This post uses deprecated features. Please reference the map custom regions with reverse geocoding documentation for current instructions.
TLDR; This blog post will outline how to efficiently write, execute, and debug watches. This will be useful if you are writing your own watches or if you have to debug watches written by your colleagues.
The correct workflow
A watch is the unit of work for the alerting features in X-Pack. A watch contains information on where to get data from that could trigger the alert, a condition to trigger upon, which actions to take when that condition holds true, and how often the watch should be executed. Let’s say we want to add a watch to our system and see if it triggers how we want it to. Let's go with the most simple watch we can imagine:
PUT _xpack/watcher/watch/my-watch { "trigger" : { "schedule" : { "interval": "5m" } }, "input" : { "simple" : { "foo": "bar" } }, "actions" : { "log-me-action" : { "logging" : { "text" : "Payload is {{ctx.payload}}" } } } }
This is the most barebones watch. We have to define a trigger
, an input
and a single action. If you do not specify a condition, an always
condition will be used, so the actions are always executed.
Now you have two choices. First, you could wait for the scheduled execution run to check if everything works as expected. This, however, would make your testing cycle viciously slow, so you should not do that. Instead you could use the Execute Watch API to execute the watch manually like this:
POST _xpack/watcher/watch/my-watch/_execute
You will get a longish output that shows the execution path of the watch. In this case, it will simply return the configured input. However, a search or a webhook might return the HTTP response from the destination host. You will see information if the condition was met, and lastly you will see for each action, if it was executed and its return value. This entry is exactly the same as the one that is stored in the watch history, when the watch is triggered automatically.
However this workflow still has a weakness: you have to execute two calls. First you store the watch then you execute the watch. In order to overcome this, the Execute Watch API allows you to specify a watch like this
PUT _xpack/watcher/watch/_execute { "watch" : { "trigger" : { "schedule" : { "interval": "5m" } }, "input" : { "simple" : { "foo": "bar" } }, "actions" : { "log-me-action" : { "logging" : { "text" : "Payload is {{ctx.payload}}" } } } } }
When you check the Execute Watch API response, you will see that the watch_record.watch_id
field of the watch is referred to as inlined
because we supplied the watch with the request. This saves you an additional round-trip of creating a watch and executing it. But this is just the first step! Let's do a quick excursion into the watch history, and then we'll come back for more and better testing.
A journey into the watch history
The watch history is stored using time-based indices that are created automatically by Watcher on a daily basis. Whenever a watch is executed, a corresponding watch history entry is created for the execution. Let's examine a sample watch history for the above watch, when we ran the Execute Watch API.
{ "_id" : "_inlined__7310b96e-b693-4092-b54f-c54cf178fde5-2017-03-06T13:13:35.467Z", "watch_record" : { "watch_id" : "_inlined_", "state" : "executed", "trigger_event" : { "type" : "manual", "triggered_time" : "2017-03-06T13:13:35.467Z", "manual" : { "schedule" : { "scheduled_time" : "2017-03-06T13:13:35.467Z" } } }, "input" : { "simple" : { "foo" : "bar" } }, "condition" : { "always" : { } }, "result" : { "execution_time" : "2017-03-06T13:13:35.467Z", "execution_duration" : 2, "input" : { "type" : "simple", "status" : "success", "payload" : { "foo" : "bar" } }, "condition" : { "type" : "always", "status" : "success", "met" : true }, "actions" : [ { "id" : "log-me-action", "type" : "logging", "status" : "success", "logging" : { "logged_text" : "Payload is {foo=bar}" } } ] }, "messages" : [ ] } }
The state
field indicates whether the watch was executed successfully. Usually you will see the executed
state. What else could you see? The state could indicate that a watch was acknowledged, throttled, the condition was not met or that the dedicated threadpool and its queue were filled up when the watch tried to be executed.
The trigger_event
field contains information on what triggered this watch, which is usually a schedule-based trigger (unless you use the execute watch API yourself).
The input
structure only contains the information defined in our input. This is to ensure that you have a way to exactly know what was requested for this execution of the watch, as this information would disappear when you update your watch..
The condition
structure in this concrete example is just expanded to an always
condition. Again the same applies as with the input
field: this ensures you know the condition of the watch at the time of the execution, even if you change it later on.
Now on the interesting part: the result
field. For each execution, regardless of input, condition, or action, the data gathered is logged. Let's take a look at the simple input result first
"input" : { "type" : "simple", "status" : "success", "payload" : { "foo" : "bar" } },
This output does not yield any further information and thus is a bit boring, but so is the simple
input. However, the output of the http
input is vastly different
"input" : { "http" : { "request": { "url" : "http://localhost:9200/_cluster/health" } } }
With this input, the output is much more verbose
"result" : { ... "input" : { "type" : "http", "status" : "success", "payload" : { "number_of_pending_tasks" : 0, "cluster_name" : "elasticsearch", "active_shards" : 3, "active_primary_shards" : 3, "unassigned_shards" : 3, "delayed_unassigned_shards" : 0, "timed_out" : false, "relocating_shards" : 0, "_headers" : { "content-length" : [ "389" ], "content-type" : [ "application/json; charset=UTF-8" ] }, "initializing_shards" : 0, "task_max_waiting_in_queue_millis" : 0, "number_of_data_nodes" : 1, "number_of_in_flight_fetch" : 0, "active_shards_percent_as_number" : 50.0, "_status_code" : 200, "status" : "yellow", "number_of_nodes" : 1 }, "http" : { "request" : { "host" : "localhost", "port" : 9200, "scheme" : "http", "method" : "get", "path" : "/_cluster/health" }, "status_code" : 200 } } }
As you can see, the payload
field, which you can access via ctx.payload
in the condition and the actions, has been enriched with the JSON response from a cluster health request, along with a few more fields like _status_code
and the HTTP response headers in _headers
. Also there is a generic status
field for each input, that indicates if the input execution was successful.
Next is the condition in the result. Again, our omitted condition is somewhat boring, as this only marks the condition being met due to always returning true. However let’s take a look at this scripted condition, which is only met if the cluster health API reports more than three nodes
"condition" : { "script" : { "inline" : "return ctx.payload.number_of_nodes > 3", "lang" : "painless" } }
As we run a single node cluster, this condition will not match and this is what the output looks like.
"condition" : { "type" : "script", "status" : "success", "met" : false }
The output status is marked as success
because the script itself was executed successfully. A nice feature of the painless
scripting language are its exact stack traces with spot-on error messages. As we do not want to to lose this information, in case you made a mistake the exception
field in the watch history will be enriched with the whole stack trace. For example, specifying a non-existing field as in return ctx.payload.numbr_of_nodes > 3
will produce an additional script_stack
field, that contains the exact location where the script failed.
"script_stack" : [ "return ctx.payload.numbr_of_nodes > 3", " ^---- HERE" ],
If you are interested in further painless debugging, you should take a look at the painless documentation.
The next part of the result field contains information for each action being executed. In our simple example, we only see the text that is also logged into the Elasticsearch log file. However, when your output becomes a little more complex, there is a pile of useful information in there. Let's use the index
action as an example:
"actions" : { "index-me-action" : { "index" : { "index" : "foo", "doc_type" : "bar" } } }
This action yields the following output
"actions" : [ { "id" : "index-me-action", "type" : "index", "status" : "success", "index" : { "response" : { "created" : true, "result" : "created", "id" : "AVqj7nBXM9pxFWYxpNpM", "version" : 1, "type" : "bar", "index" : "foo" } } } ]
Last but not least, the result
field also contains information such as when the watch was executed in the execution_time
as well as the execution duration in the execution_duration
field in milliseconds.
Using alternative inputs to test time-based queries
Now we know how to decipher the watch history and how to run a watch, but how can we actually debug watches? Also, we can still improve our workflow to create a watch. Let's imagine we are writing a watch to monitor disk space used from Metricbeat data. In order to detect machines that are running low on disk space, we need to search for documents containing disk space information which have been created in the last five minutes and contain a disk usage of more than 90%. Then we want to aggregate the results per host. Note, that this is a very simple example, you actually may want to aggregate per host and per mount-point to get more exact alerts and/or better filter capabilities.
The query might look similar to the following one
GET metricbeat-2017.03.06/_search { "size": 0, "query": { "bool": { "filter": [ { "range": { "@timestamp": { "from": "now-5m" } } }, { "term": { "metricset.name": "filesystem" } }, { "range": { "system.filesystem.used.pct": { "from": 0.9 } } } ] } }, "aggs": { "by_host": { "terms": { "field": "beat.hostname", "size": "100" } } } }
The response might be something like this:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 60, "max_score": 0, "hits": [] }, "aggregations": { "by_host": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "orca", "doc_count": 60 }, { "key": "dolphin", "doc_count": 120 } ] } } }
In this example two hosts have a fair share of documents above the threshold. So, we could go easy and add the watch around the written query now. However, in order to properly test our watch, we would need to test the case of system being low on disk space as well as the case where everything is fine. Therefore we need a way to emulate the above input. Watcher supports exactly this by using an alternative_input
in the execute watch API. Again we are using the Execute Watch API without storing the watch - as we are still not sure, whether this is going to be our final watch.
POST _xpack/watcher/watch/_execute { "alternative_input" : { "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 60, "max_score": 0, "hits": [] }, "aggregations": { "by_host": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "orca", "doc_count": 60 }, { "key": "dolphin", "doc_count": 120 } ] } } }, "watch" : { "trigger" : { "schedule" : { "interval": "5m" } }, "input" : { "search" : { "request": { "indices" : [ "<metricbeat-{now/d}>"], "body" : { "query": { "bool": { "filter": [ { "range": { "@timestamp": { "from": "{{ctx.trigger.scheduled_time}}||-5m" } } }, { "term": { "metricset.name": "filesystem" } }, { "range": { "system.filesystem.used.pct": { "from": 0.5 } } } ] } }, "size": 0, "aggs": { "by_host": { "terms": { "field": "beat.hostname", "size": "100" } } } } } } }, "condition" : { "script" : "return ctx.payload.aggregations.by_host.buckets.size() > 0" }, "actions" : { "logme-me-action" : { "transform" : { "script" : { "inline" : "return [ 'hosts' : ctx.payload.aggregations.by_host.buckets.stream().map(b -> b.key).collect(Collectors.toList())]" } }, "logging" : { "text" : "Over the limit: {{#join delimiter=', '}}ctx.payload.hosts{{/join delimiter=', '}}" } } } } }
Great, we were now able to test a watch with arbitrary input data without actually executing the query. There are two things, that you might want to improve. First, this mechanism still requires you to execute a query and then paste the response of that query into the alternative_input
field. We can do better than this by taking over more control of the query. The timestamp for the range query we used, contained the scheduled time of the watch execution as a filter. So, if we can control the scheduled time from the outside, there is no need to control the whole input, we could just go back in time by changing the scheduled time. By removing the alternative_input
field and adding trigger_data
to the execute watch API request you can achieve exactly that
"trigger_data" : { "scheduled_time" : "2017-03-06T12:34:56.055Z" }
This way, the specified scheduled time will be used. Note, that we are still using now/d
in the index name (which in addition could also roll over to the next day), so this is not yet completely fool proof, but a good start. In case you are wondering what the difference between scheduled_time
and triggered_time
is. The first is the time, when the watch was supposed to be run (i.e. in five minutes), whereas the second one is the time, when the watch was triggered to be executed (i.e. in five minutes fifteen seconds).
Reading the currently set up watch, there is one more issue to solve. Do we just want to log data in our action configuration? I do not think so. But if we add an email action, do you really want to send out an email to the person on-call right now? Maybe not, so we need to find a way to check if an action would be executed, without executing the action. Let's go back to our index
action from the earlier example and simulate it
Simulating actions
Remember our nice little index action? This one
"actions" : { "index-me-action" : { "index" : { "index" : "foo", "doc_type" : "bar" } } }
Running this with the execute watch API creates a new document for every execution. This might skew metrics or run times, so what if we could simulate this? It turns out we can solve this problem by specifying the action mode as part of the execute watch API:
"action_modes" : { "index-me-action" : "simulate" }
If you check the result for the action in the response now, you will see that the action got a status called simulated
. This means, the request has not been sent but we know the action was triggered. Please see the documention for further action modes, which might be needed to circumvent throttling during testing.
Using the croneval tool
There is one last helper for the cron
interval schedule, that might save you a ton of headaches. There is a small CLI tool in the x-pack/bin/
directory called croneval
. This allows you to supply a cron expression and see when it matches next. Take this example
./bin/x-pack/croneval -c 4 '0 5 9 ? * 6#1' Valid! Now is [Mon, 6 Mar 2017 15:30:02] Here are the next 4 times this cron expression will trigger: 1. Fri, 7 Apr 2017 11:05:00 2. Fri, 5 May 2017 11:05:00 3. Fri, 2 Jun 2017 11:05:00 4. Fri, 7 Jul 2017 11:05:00
Final words
I hope you have enjoyed our journey into alerting. We still have not covered everything (think of conditions per action, transforms per action and in general, the possibility to ignore the condition when executing a watch via the execute watch API, acking & throttling, which also is taken into account when simulating actions). We might leave this for a more advanced post in the future. Happy alerting!
Last but not least, with the release of the Elastic Stack 5.4, there is also a new Watcher UI, that you definitely should check out!
If you have further questions about this blog post, than please join the x-pack discussion forum and we are happy to help!