Google BigQuery output plugin

edit

Google BigQuery output plugin

edit
  • Plugin version: v4.1.1
  • Released on: 2018-10-25
  • Changelog

Installation

edit

For plugins not bundled by default, it is easy to install by running bin/logstash-plugin install logstash-output-google_bigquery. See Working with plugins for more details.

Getting Help

edit

For questions about the plugin, open a topic in the Discuss forums. For bugs or feature requests, open an issue in Github. For the list of Elastic supported plugins, please consult the Elastic Support Matrix.

Description

edit

Summary

edit

This Logstash plugin uploads events to Google BigQuery using the streaming API so data can become available to query nearly immediately.

You can configure it to flush periodically, after N events or after a certain amount of data is ingested.

Environment Configuration

edit

You must enable BigQuery on your Google Cloud account and create a dataset to hold the tables this plugin generates.

You must also grant the service account this plugin uses access to the dataset.

You can use Logstash conditionals and multiple configuration blocks to upload events with different structures.

Usage

edit

This is an example of Logstash config:

output {
   google_bigquery {
     project_id => "folkloric-guru-278"                        (required)
     dataset => "logs"                                         (required)
     csv_schema => "path:STRING,status:INTEGER,score:FLOAT"    (required) 
     json_key_file => "/path/to/key.json"                      (optional) 
     error_directory => "/tmp/bigquery-errors"                 (required)
     date_pattern => "%Y-%m-%dT%H:00"                          (optional)
     flush_interval_secs => 30                                 (optional)
   }
}

Specify either a csv_schema or a json_schema.

If the key is not used, then the plugin tries to find Application Default Credentials

Considerations

edit
  • There is a small fee to insert data into BigQuery using the streaming API.
  • This plugin buffers events in-memory, so make sure the flush configurations are appropriate for your use-case and consider using Logstash Persistent Queues.
  • Events will be flushed when batch_size, batch_size_bytes, or flush_interval_secs is met, whatever comes first. If you notice a delay in your processing or low throughput, try adjusting those settings.

Google BigQuery Output Configuration Options

edit

This plugin supports the following configuration options plus the Common Options described later.

Also see Common Options for a list of options supported by all output plugins.

 

batch_size

edit

Added in 4.0.0.

  • Value type is number
  • Default value is 128

The maximum number of messages to upload at a single time. This number must be < 10,000. Batching can increase performance and throughput to a point, but at the cost of per-request latency. Too few rows per request and the overhead of each request can make ingestion inefficient. Too many rows per request and the throughput may drop. BigQuery recommends using about 500 rows per request, but experimentation with representative data (schema and data sizes) will help you determine the ideal batch size.

batch_size_bytes

edit

Added in 4.0.0.

  • Value type is number
  • Default value is 1_000_000

An approximate number of bytes to upload as part of a batch. This number should be < 10MB or inserts may fail.

csv_schema

edit
  • Value type is string
  • Default value is nil

Schema for log data. It must follow the format name1:type1(,name2:type2)*. For example, path:STRING,status:INTEGER,score:FLOAT.

dataset

edit
  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The BigQuery dataset the tables for the events will be added to.

date_pattern

edit
  • Value type is string
  • Default value is "%Y-%m-%dT%H:00"

Time pattern for BigQuery table, defaults to hourly tables. Must Time.strftime patterns: www.ruby-doc.org/core-2.0/Time.html#method-i-strftime

deleter_interval_secs

edit

Deprecated in 4.0.0.

Events are uploaded in real-time without being stored to disk.

error_directory

edit

Added in 4.0.0.

  • This is a required setting.
  • Value type is string
  • Default value is "/tmp/bigquery".

The location to store events that could not be uploaded due to errors. By default if any message in an insert is invalid all will fail. You can use skip_invalid_rows to allow partial inserts.

Consider using an additional Logstash input to pipe the contents of these to an alert platform so you can manually fix the events.

Or use GCS FUSE to transparently upload to a GCS bucket.

Files names follow the pattern [table name]-[UNIX timestamp].log

flush_interval_secs

edit
  • Value type is number
  • Default value is 5

Uploads all data this often even if other upload criteria aren’t met.

ignore_unknown_values

edit
  • Value type is boolean
  • Default value is false

Indicates if BigQuery should ignore values that are not represented in the table schema. If true, the extra values are discarded. If false, BigQuery will reject the records with extra fields and the job will fail. The default value is false.

You may want to add a Logstash filter like the following to remove common fields it adds:

mutate {
    remove_field => ["@version","@timestamp","path","host","type", "message"]
}

json_key_file

edit

Added in 4.0.0.

Replaces key_password

  • Value type is string
  • Default value is nil

If Logstash is running within Google Compute Engine, the plugin can use GCE’s Application Default Credentials. Outside of GCE, you will need to specify a Service Account JSON key file.

json_schema

edit
  • Value type is hash
  • Default value is nil

Schema for log data as a hash. These can include nested records, descriptions, and modes.

Example:

json_schema => {
  fields => [{
    name => "endpoint"
    type => "STRING"
    description => "Request route"
  }, {
    name => "status"
    type => "INTEGER"
    mode => "NULLABLE"
  }, {
    name => "params"
    type => "RECORD"
    mode => "REPEATED"
    fields => [{
      name => "key"
      type => "STRING"
     }, {
      name => "value"
      type => "STRING"
    }]
  }]
}

key_password

edit

Deprecated in 4.0.0.

Replaced by json_key_file or by using ADC. See json_key_file

key_path

edit

Obsolete: The PKCS12 key file format is no longer supported.

Please use one of the following mechanisms:

  • Application Default Credentials (ADC), configured via environment variables on Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
  • A JSON authentication key file. You can generate them in the console for the service account like you did with the .P12 file or with the following command: gcloud iam service-accounts keys create key.json --iam-account [email protected]

project_id

edit
  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

Google Cloud Project ID (number, not Project Name!).

service_account

edit

Deprecated in 4.0.0.

Replaced by json_key_file or by using ADC. See json_key_file

skip_invalid_rows

edit

Added in 4.1.0.

  • Value type is boolean
  • Default value is false

Insert all valid rows of a request, even if invalid rows exist. The default value is false, which causes the entire request to fail if any invalid rows exist.

table_prefix

edit
  • Value type is string
  • Default value is "logstash"

BigQuery table ID prefix to be used when creating new tables for log data. Table name will be <table_prefix><table_separator><date>

table_separator

edit
  • Value type is string
  • Default value is "_"

BigQuery table separator to be added between the table_prefix and the date suffix.

temp_directory

edit

Deprecated in 4.0.0.

Events are uploaded in real-time without being stored to disk.

temp_file_prefix

edit

Deprecated in 4.0.0.

Events are uploaded in real-time without being stored to disk

uploader_interval_secs

edit

Deprecated in 4.0.0.

This field is no longer used

  • Value type is number
  • Default value is 60

Uploader interval when uploading new files to BigQuery. Adjust time based on your time pattern (for example, for hourly files, this interval can be around one hour).

Common Options

edit

The following configuration options are supported by all output plugins:

Setting Input type Required

codec

codec

No

enable_metric

boolean

No

id

string

No

workers

number

No

codec

edit
  • Value type is codec
  • Default value is "plain"

The codec used for output data. Output codecs are a convenient method for encoding your data before it leaves the output, without needing a separate filter in your Logstash pipeline.

enable_metric

edit
  • Value type is boolean
  • Default value is true

Disable or enable metric logging for this specific plugin instance by default we record all the metrics we can, but you can disable metrics collection for a specific plugin.

  • Value type is string
  • There is no default value for this setting.

Add a unique ID to the plugin configuration. If no ID is specified, Logstash will generate one. It is strongly recommended to set this ID in your configuration. This is particularly useful when you have two or more plugins of the same type, for example, if you have 2 google_bigquery outputs. Adding a named ID in this case will help in monitoring Logstash when using the monitoring APIs.

output {
  google_bigquery {
    id => "my_plugin_id"
  }
}

workers

edit
  • Value type is string
  • Default value is 1