As the newest generally available signal in OpenTelemetry (OTel), logging support currently lags behind tracing and metrics in terms of feature scope and maturity. At Elastic, we bring years of extensive experience with logging use cases and the challenges they present. Committed to advancing OpenTelemetry's logging capabilities, we have focused on enhancing its logging functionalities.
Over the past few months, we have dealt with the capabilities of the filelog receiver in the OpenTelemetry Collector, leveraging our expertise as the Filebeat's maintainers to help refine and expand its potential. Our goal is to contribute meaningfully to the evolution of OpenTelemetry's logging features, ensuring they meet the high standards required for robust observability.
Specifically, we focused on verifying that the receiver is well covered for cases and aspects that have been a pain for us in the past with Filebeat — such as fail-over handling, self-telemetry, test coverage, documentation and usability. Based on our exploration, we started insightful conversations with the OTel project's maintainers, sharing our thoughts and any suggestions that could be useful from our experience. Moreover, we've started putting up PRs to add documentation, make enhancements, improve tests, fix bugs, and even implement completely new features.
In this blog post we'll provide a sneak preview of the work that we've done so far in collaboration with the OpenTelemetry community and what's coming next as we continue to explore ways to improve the OpenTelemetry Collector for log collection.
Enhancing the filelog receiver's telemetry
Observability tools are software components like any other and, thus, need to be monitored as any other software to be able to debug problems and tune relevant settings. In particular, users of the filelog receiver will want to know how it's performing. It's important that the filelog receiver emits sufficient telemetry data for common troubleshooting and optimization use cases. This includes sufficient logging and observable metrics providing insights into the filelog receiver's internal state.
While the filelog receiver already provided a good set of self-telemetry data, we identified some areas of improvement. In particular, we contributed functionality to emit self-telemetry logs on crucial events like when log files are discovered, moved or truncated. Another contribution includes observable metrics about filelog’s receiver internal state about how many files are opened and being harvested. You can find more information on the respective tracking issue.
Improving the Kubernetes container logs parsing
The filelog receiver has been able to parse Kubernetes container logs for some time now. However, properly parsing logs from Kubernetes Pods required a fair bit of configuration to deal with different runtime formats and to extract important meta information, such as
You can learn more about the details of the new container logs parser in the corresponding OpenTelemetry blog post.
Evaluating test coverage
Logs collection from files can run into different unexpected scenarios such as restarts, overload and error scenarios.
To ensure reliable and consistent collection of logs, it's important to ensure tests cover these kind of scenarios.
Based on our experience with testing Filebeat, we evaluated the existing filelog receiver tests with respect to those scenarios.
While most of the use cases and scenarios were well-tested already, we identified a few scenarios to improve tests for to ensure reliable logs collection.
At the creation time of this blog posts we were working on contributing additional tests to address the identified test coverage gaps.
You can learn more about it in this GitHub issue.
Persistence evaluation
Another important aspect for log collection that we often hear from Elastic's log users are the failover handling capabilities and the delivery guarantees for logs. Some logging use cases, for example audit logging, have strict delivery guarantee requirements. Hence, it's important that the filelog receiver provides functionality to reliably handle situations, such as temporary unavailability of the logging backend or unexpected restarts of the OTel Collector.
Overall, the filelog receiver already has corresponding functionality to deal with such situations. However, user documentation on how to setup reliable logs collection with tangible examples was an area with potential for improvement.
In this regard, beyond verifying the persistence and offset tracking capabilities we worked on improving respective documentation 1 2 and also are collaborating on a community reported issue to ensure delivery guarantees for logs.
Helping users help themselves
Elastic has a long and varied history of supporting customers who use our products for log ingestion. Drawing from this experience, we've proposed a couple of documentation improvements to the OpenTelemetry Collector to help logging users get out of some tricky situations.
Documenting the structure of the tracking file
For every log file the filelog receiver ingests, it needs to track how far into the file it has already read, so it knows where to start reading from when new contents are added to the file. By default, the filelog receiver doesn't persist this tracking information to disk, but it can be configured to do so. We felt it would be useful to document the structure of this tracking file. When ingestion stops unexpectedly, peeking into this tracking file can often provide clues as to where the problem may lie.
Challenges with symlink target changes
The filelog receiver periodically refreshes its memory of the files it's supposed to be ingesting. The interval at which these refreshes happen is controlled by the
Planning ahead for the receiver's GA
Last but not least, we have raised the topic of making the filelog receiver a generally available (GA) component. For users it's important to be able to rely on the stability of used functionality, hence, not being required to deal with the risk of breaking changes through minor version updates. In this regard, for the filelog receiver we have kicked off a first plan with the maintainers to mark any issue that is a blocker for stability with a
Conclusion
Overall, OTel's filelog receiver component is in a good shape and provides important functionality for most log collection use cases. Where there are still minor gaps or need for improvement with the filelog receiver, we are gladly to contribute our expertise and experience from Filebeat use cases. The above is just the beginning of our effort to help advancing the OpenTelemetry Collector, and specifically for log collection, get closer to a stable version. Moreover, we are happy to help the filelog receiver maintainers with general maintenance of the component, hence, dealing with community issues and PRs, jointly working on the component's roadmap, etc.
We'd like to thank the OTel Collector group and, in particular, Daniel Jaglowski for the great and constructive collaboration on the filelog receiver, so far!
Stay tuned to learn more about our future contributions and involvement in OpenTelemetry.