SAP Concur: Elastic Logging as a DevOps Strategy
Editor's Note: With the release of Elastic Stack 7.11, the new alerting framework is now generally available. In addition to existing connectors to 3rd party platforms like Slack PagerDuty, and Servicenow, 7.11 adds Microsoft Teams to the list of built-in alerting integrations. Read more about this update in our alerting release blog.
This post is a recap of a community talk given at Elastic{ON} 2018. Interested in seeing more talks like this? Check out the conference archive or find out when the Elastic{ON} Tour is coming to a city near you.
If you've ever entered an expense report, there's a good chance you'd done it through SAP Concur. With over 45 million users spanning more than 150 countries (including 70% of the Fortune 500), Concur is a top travel and expense solution. In 2016 alone, the SaaS offering processed over $87 billion USD in expenses, meaning over 2.4 million receipts and $187 million USD in invoicing every day. That may seem like a lot of line items for accounting, but it creates even more log lines for a logging solution to handle on a daily basis.
Concur has been around for over 20 years, and as their product offerings have grown and evolved, so has their logging solution. Not just in the technology that they use, but also in the scope and intent of its usage. Initially a SQL-based solution used for simple log storage, their current logging solution — built on the Elastic Stack — helps promote end-to-end application ownership, and aligns development, testing, and operations. And in the future, Concur's LAMA (Logging, Alerting, Monitoring, and Analytics) team plans to use Elastic machine learning for operational analytics and insight as well as for automating rollouts and rollbacks. They've taken great leaps in logging, but they didn't get from log storage to analytics overnight.
Originally built on a relational database, their logging solution ingested log data as XML via RabbitMQ, and their users loved that they could easily query for logs using SQL. But as the popularity of the service grew, so did usage. As peak ingest grew to 200 GB/day — with rates in upwards of 1,500 docs/sec — the service reached its limits, and performance-based service lags could force users to wait up to 20 minutes for a log to be available in the system. In response, all the logging team was able to do was put their database on more powerful hardware, which was an unsustainable process. What they needed was horizontal scalability, so they set out to find a better solution.
After researching Elasticsearch and hearing about different success stories from companies in similar situations, Concur chose the Elastic Stack as their logging solution. It was fast, it was powerful, and it was scalable — and (possibly more) importantly to their internal users, it had a visualization component that their users loved. Previously, different teams would build their own interfaces and dashboards, often incurring licensing fees for the tools they had to use to get the job done. With Kibana, Concur had a unified visualization solution, removing the need for homegrown or 3rd party visualization solutions.
The first implementation of Elastic was with Elasticsearch 1.1 and Kibana 3, with ingest coming from Logstash, RabbitMQ (same as they'd used with the SQL solution), and Fluentd. The logging team was also able to build their own alerting plugin (a benefit of the open source nature of Elastic), as one did not yet exist within the Elastic Stack. Between the increased speed of Elasticsearch, the visualizations of Kibana, and the alerting features of their homegrown Watcher plugin, service adoption increased across Concur and ingest skyrocketed to 5,000 doc/sec. That's something their SQL solution couldn't have come close to handling.
Growing from Solution to Strategy with Elastic
Since that initial implementation, Concur's logging solution has grown with the Elastic Stack. In 2015, they upgraded to Elasticsearch 2.3 and Kibana 4.5, purchased a Gold subscription, and began using Beats (as a replacement for Fluentd), Watcher (to replace their homegrown solution) and Shield (for security). They also built another custom plugin, this time a custom aggregation UI. As their logging solution improved, so did adoption, and by 2017, their ingest rate was up to 60,000 doc/sec (4TB/day).
After attending Elastic{ON} 2017, Concur upgraded again, this time to take advantage of cross cluster search, improved security (needed to ensure GDPR compliance), and other new Elastic Stack features they'd learned about during the conference. Using cross cluster search, they were able to break up their monolithic cluster into multiple, smaller clusters spread across multiple regions. This version upgrade — as well as their move to a Platinum subscription — has helped them to establish the environment they use today, with a variety of ingest sources, Elasticsearch clusters across multiple regions (5TB/day in the US), and Kibana dashboards used by operations, SREs, support, executive leadership, and more. And all that is managed by a LAMA Team made up of six engineers and two managers.
Learn about how Concur went from log storage to ownership enablement by watching Elastic @ SAP Concur: Driving the Journey to DevOps and End-to-End Ownership from Elastic{ON} 2018. You'll also learn how they enabled one-click logging service deployment, how they configured mappings (non-dynamic) and fields for over 200 teams, and what their plans are for leveraging the power of Elastic machine learning.