How to

Elasticsearch Audit Trail and Log File Filter Policies Explained

This post provides a thorough overview of the audit functionality of an Elasticsearch cluster. It defines some audit trail concepts, clarifies some implementation decisions and delves into some of the different configuration options. The takeaway is that the audit trail goes beyond a bunch of printing statements intertwined with the code that handles the authentication and authorization: log events track a client’s interaction with the cluster. This post is aimed at users that need to configure the auditing functionality, who are also concerned with the impact on system performance.

Audit logs provide a trail of records describing the actions of all the agents interacting with system resources. Once generated, the trail is consumed at a later date, sometimes by a human investigator. A comprehensive audit trail is required to ensure accountability and to demonstrate compliance.

In the context of this blog, the system is represented by the Elasticsearch cluster. The resources are the indexed documents as well as various types of metadata. One type of a metadata resource is the users and roles information required internally for authentication and authorization. One other type is describing background computations such as watches, machine learning and rollup jobs. Yet another one is the settings of the cluster. Each such resource type has specific actions permitted.

Audit records are nothing more than structured text rows that point to the when, who, what and how for each action acting on resources. Agents — the who — might be services that are intrinsic to the system, or they might be humans acting using software clients. Actions — the how — denote the most granular operations that can be authorized. In relation to this, a request is comprised of several operations which are further comprised of actions, all of which will be recorded as individual entries. In addition to actions, a few other event types are audited, the most notable one being the authentication of REST requests. The guide to the Elastic Stack contains the complete list of the audited events.

It is worth pointing out that there is currently no way to tie together an action audit entry to the event audit entry of the request that had caused the action. There is no concept of a “session” that ties all the related audit events together. In particular, the login or logout concepts are undefined. The credentials of an agent are authenticated with each request.

The log file audit trail consists of multiple regular files, one for each node in the cluster, and each stored on their respective cluster node. For performance and security reasons, each audit log file should be written to a local filesystem to assure the security and durability of the file. Each node's log file contains only audit entries for the local events and actions, and there is no central store collating and ordering the audit entries for all the cluster's nodes. Each file is only appended to — not overwritten — and at some point, it will be rotated. Rotation means that the old file will be closed, potentially compressed, and a new empty file will be created and used for further recording. This behavior is standard for any logging framework. In fact the log file audit trail, as well as all the other Elasticsearch log trails, use the popular log4j2 logging framework. For more details and examples about how to configure logging in Elasticsearch, including logfile auditing, see the logging guide.

Besides appending audit events to a file, there is the option to store these events straight into Elasticsearch via the audit indexing functionality. Please note that this feature will eventually be deprecated and replaced in 7.0, as this method does not have the guaranteed durability which is required for a reliable audit trail. Most information presented next does not apply to the indexed audit output type.

The audit trail is verbose by design, but this brings the inevitable drawback of the I/O performance penalty. The log file audit trail "wastes" I/O ops, which are not spent on user's query processing. Auditing spends more I/O the busier the cluster is, so if I/O is a limiting factor for the type of load and for the hardware used, and this is negatively impacting SLAs, the audit records that the administrator deems unimportant can be excluded. There are two configuration options, which may be used together, that will selectively drop audit entries before enqueuing them for write. These options let you tune audit verbosity. When utilizing any such setting, you are acknowledging potential accountability gaps that could render illegitimate actions undetectable. Please take time to review these audit policies whenever your system architecture changes.

Option 1 — Audit Event Types

The first configuration option available allows the administrator to selectively pick entire classes of events. The possible audit event classes are: anonymous_access_denied, authentication_failed, realm_authentication_failed, access_granted, access_denied, tampered_request, connection_granted, connection_denied, system_access_granted, authentication_success, run_as_granted and run_as_denied. The auditing section in the Elastic Stack Overview docs explains when each type of event is logged. Audit event classes are controlled by the following two configuration options: xpack.security.audit.logfile.events.include, defaulting to access_denied, access_granted, anonymous_access_denied, authentication_failed, connection_denied, tampered_request, run_as_denied and run_as_granted, and xpack.security.audit.logfile.events.exclude defaulting to the empty list. Both options accept lists of event classes as arguments. The exclude option takes precedence, i.e. if an audit event class is listed in both include and exclude it will be excluded.

These are dynamic cluster settings. For example:

PUT /_cluster/settings
{
  "transient": {
    "xpack.security.audit.logfile.events.exclude": [
        "run_as_granted",
        "anonymous_access_denied"
        ]
    }
}

This would not record further run_as_granted and anonymous_access_denied events, until the setting is changed once more, or the cluster is restarted. Because the setting update is marked as transient, if the cluster is restarted, audit records of these two types would start to be recorded again. This assumes that the include setting is set to default, or has them listed, i.e.:

PUT /_cluster/settings
{
  "persistent": {
    "xpack.security.audit.logfile.events.include": [
        "run_as_granted",
        "anonymous_access_denied"
        ]
    }
}

Requirements vary, but typically you would want access_denied and access_granted event classes to be included. These enable recording authorization audit events of the actions executed by the agents while accessing resources, as previously described. However, in a typical cluster, audit entries of these classes constitute the bulk of the corpus. So, even if excluding all the other audit events, which is not recommended, the remaining corpus might still be too much of an I/O burden.

Option 2 — Ignore Policies

This is where configuration option two, the audit log ignore policies, is useful. This option allows the administrator to exclude audit records based on the contents of the audit record itself. This option only allows for entries to be excluded, i.e. there is no include counterpart option. In essence, it is similar to an embedded grep engine, but enhanced with domain knowledge about the audit event's attributes. Compared to a general purpose regex, this allows for easier writing of exclude rules, which are more readable, thus minimizing the probability of errors. In addition, record pruning is done earlier than an external grep utility or the log4j2 filters would do, thus saving processing time.

Ignore policies are defined under the xpack.security.audit.logfile.events.ignore_filters settings namespace. An ignore policy is a named dictionary of filter rules, with each filter rule applying to a specific audit record attribute. The filter rule is keyed inside the policy map by the attribute it applies to. Currently, only the following attributes are permitted for rules: users, realms, roles or indices, although audit records contain other attributes as well. The value of a filter rule is a list of Lucene regexp. The rule matches an audit record if any regexp from the list matches the value of the attribute for which the rule is defined. The policy matches an audit record, if all rules comprising it match their respective attributes. If any policy matches some particular audit record, then the record is excluded and it will not be printed.

Before jumping to the first example, it is important to mention that these are dynamic cluster settings as well:

Example — Single Policy Audit

PUT /_cluster/settings 
{
    "persistent": {
        "xpack.security.audit.logfile.events.ignore_filters": {
            "exclude_admin_users": {
                "users": [
                    "logstashUser",
                    "elasticSuperUser",
                    "kibanaAdmin",
                    "_xpack_*"
                ]
            }
        }
    }
}

This creates a single logfile audit filter policy named excludeadminusers with a single rule that applies to the user attribute of audit records. If any audit record has the user attribute matching the xpack* wildcard or any of logstashUser, elasticSuperUser or kibanaAdmin then that record will not be printed.

Example — Multi-Policy Audit

PUT /_cluster/settings 
{
    "transient": {
        "xpack.security.audit.logfile.events.ignore_filters": {
            "exclude_system_users": {
                "users": [
                    "_system",
                    "_xpack_security"
                ]
            }
            "exclude_kibana_index": {
                "indices": [
                    ".kibana"
                ]
            }
        }
    }
}

The second example defines two policies. In this case all audit records touching the .kibana index or ascribed to the _system or the _xpack_security users will be ignored. This is because the audit records stream is sifted through each policy in turn. In the previous example, the two rules are collated under a single policy:

PUT /_cluster/settings 
{
    "transient": {
        "xpack.security.audit.logfile.events.ignore_filters": {
            "single_policy": {
                "users": [
                    "_system",
                    "_xpack_security"
                ]
                "indices": [
                    ".kibana"
                ]
            }
        }
    }
}

This means that only the events ascribed to the _system or the _xpack_security users and also touching the .kibana index will be excluded. This is because, for a policy to match a record, all its comprising rules should match the respective attributes of that record.

Improve Your Audits

Audit records of different classes have different attribute sets. If a record does not contain an attribute for which some policy defines rules, the audit record will not match the policy. For a few more corner case examples, check the auditing-log-ignore-policy section in the guide.

Hopefully, these features will help you get more out of the auditing feature. The dynamic nature of these settings should allow for iterative development, as you’ll be able to trim records piece by piece as you inspecting the log file audit trail of a cluster running under normal operating circumstances. Finally, remember that ignore policies should be as specific as possible, as sloppy policies and rules could create accountability gaps.