We are excited to announce the general availability of the Azure OpenAI Integration that provides comprehensive Observability into the performance and usage of the Azure OpenAI Service! Also look at Part 2 of this blog
While we have offered visibility into LLM environments for a while now, the addition of our Azure OpenAI integration enables richer out-of-the-box visibility into the performance and usage of your Azure OpenAI based applications, further enhancing LLM Observability.
The Azure OpenAI integration leverages Elastic Agent’s Azure integration capabilities to collect both logs (using Azure EventHub) and metrics (using Azure Monitor) to provide deep visibility on the usage of the Azure OpenAI Service.
The integration includes an out-of-the-box dashboard that summarizes the most relevant aspects of the service usage, including request and error rates, token usage and chat completion latency.
Creating Alerts and SLOs to monitor Azure OpenAI
As with every other Elastic integration, all the logs and metrics information is fully available to leverage in every capability in Elastic Observability, including SLOs, alerting, custom dashboards, in-depth logs exploration, etc.
To create an alert to monitor token usage, for example, start with the Custom Threshold rule on the Azure OpenAI datastream and set an aggregation condition to track and report violations of token usage past a certain threshold.
When a violation occurs, the Alert Details view linked in the alert notification for that alert provides rich context surrounding the violation, such as when the violation started, its current status, and any previous history of such violations, enabling quick triaging, investigation and root cause analysis.
Similarly, to create an SLO to monitor error rates in Azure OpenAI calls, start with the custom query SLI definition adding in the good events to be any result signature at or above 400 over a total value that includes all responses. Then, by setting an appropriate SLO target such as 99%, start monitoring your Azure OpenAI error rate SLO over a period of 7, 30, or 90 days to track degradation and take action before it becomes a pervasive problem.
Please refer to the User Guide to learn more and to get started!