Header background

Introducing OpenTelemetry collector self-monitoring dashboards

We're excited to release OpenTelemetry Collector Self-Monitoring, which gives you the power to track the health and performance of your OTel collectors directly in Dynatrace. By leveraging internal telemetry, you get immediate access to ready-made dashboards that visualize your entire collector fleet or, alternatively, let you drill down into specific collectors for detailed performance analysis and capacity planning.

In this blog post, we’ll walk you through how to observe your collectors using Dynatrace Dashboards and highlight the valuable insights you can gain from their internal telemetry. Next, we’ll show you how to set up Davis® Anomaly Detection to continuously track your collectors. You’ll learn how to configure alerts that notify you of any deviations from healthy behavior, ensuring proactive issue detection and faster response times.

Activate collector telemetry and access the dashboards

To try out the OpenTelemetry collector self-monitoring dashboards, first activate the collector’s internal telemetry. For details, follow the configuration example in our documentation.

Once your collector begins exporting its internal telemetry, go to Dynatrace Hub and install OpenTelemetry Dashboards.

Figure 1. Install OpenTelemetry Dashboards from Dynatrace Hub
Figure 1. Install OpenTelemetry Dashboards from Dynatrace Hub

Once the app is installed, the following dashboards are available in the Ready-made dashboards section:

  • OTel Collector self-monitoring (all collectors): Shows an overview of all detected OpenTelemetry Collector instances.
  • OTel Collector self-monitoring (single collector): Allows you to analyze a specific collector instance.

Collector insights in the dashboards

The Dynatrace Collector dashboards deliver comprehensive insights into the operational status and efficiency of your collector instances. Here are the key features:

  • Collector uptime and CPU time since start: Monitor how long the process has been running and the CPU time consumed.
  • Collector memory and heap usage: Track memory consumption and heap usage to ensure optimal performance.
  • Receivers: View signals accepted and refused, categorized by data type.
  • Processors: Monitor incoming and outgoing signals to understand data flow.
  • Exporters: Track signals sent, failed to enqueue, and failed to send, categorized by data type. Additionally, monitor queue size and capacity.

HTTP/gRPC requests and responses: Count, duration, and size of requests and responses to analyze communication efficiency.

Figure 2. OpenTelemetry Collector dashboard for multiple collectors
Figure 2. OpenTelemetry Collector dashboard for multiple collectors

The dashboards rely on the internal telemetry exported by any standard OpenTelemetry Collector. They have been verified with Dynatrace OTel Collector and OTel Collector contrib. If your Collector distribution uses a different service.name value that is not in the dashboard drop-down list (top-left corner), you can extend the list with your custom value and then save a copy of the dashboard.e a copy of the dashboard.

The value of internal telemetry

One of the most valuable aspects of collector self-monitoring is its ability to prevent data loss by helping you scale your infrastructure at the right time. When telemetry data is dropped due to resource constraints—for example, memory exhaustion or queue overflows—it can lead to blind spots in your observability pipeline, making it harder to troubleshoot issues or ensure compliance.

The Single Collector Dashboard plays a crucial role in avoiding this. It surfaces key performance indicators that highlight when your collectors are under pressure, enabling you to take proactive action before problems escalate. This ensures your system remains resilient, responsive, and capable of handling increasing workloads without sacrificing data integrity.

Let’s take a closer look at some of these indicators.

The Telemetry data passing through the collector section shows the number of requests that are accepted, sent, and refused for each signal. As a practical example, an increase in refused spans often indicates that the collector has exceeded its memory limit, and you should scale your collector instances to handle the increased load. Ignoring this can lead to continuous data loss at the collector’s ingestion stage.

The Memory and CPU time section displays your collector’s resource consumption, enabling effective capacity planning for future workloads and proper system scaling. These metrics help optimize infrastructure costs by preventing over-provisioning and revealing performance anomalies, like sudden spikes or drops, that prove crucial during troubleshooting and root cause analysis.

The Queue size metrics section provides insights into the exporter’s current queue size; it allows you to compare if you’re close to its maximum capacity. These metrics are vital for understanding the efficiency of data export processes. If the current queue size increases, it indicates that there are not enough workers available to send the data or that the backend receiving the data is too slow. This is another critical signal that scaling is necessary to maintain optimal performance.

Configure alerts using Davis anomaly detection

While dashboards offer a great overview for your collectors, they still rely on manual oversight. What if you could automate this process and get notified the moment something unusual happens? That’s where Davis Anomaly Detector comes in. With just a few simple steps, you can set up intelligent alerts that automatically notify you when collector behavior deviates from expected patterns, helping you catch issues early and respond proactively.

Step 1. Create a Davis Anomaly Detector

Start by creating a new Davis Anomaly Detector. In this example, we’ll monitor refused spans, which indicate that a Collector is overwhelmed and unable to ingest incoming telemetry.

Use the following metric expression to define your scope:


timeseries {sum(otelcol_receiver_refused_spans)}, by:{service.instance.id}

This configuration tracks refused spans per individual collector, identified by their unique service.instance.id.

Step 2. Define alert condition

Davis offers three powerful anomaly detection methods, each suited to different use cases:

Figure 3. Create a new Davis anomaly detector
Figure 3. Create a new Davis anomaly detector

For our example, let’s use the Static Threshold method. Set the alert to trigger if refused spans exceed a certain limit—say, 100 per minute, to get an alert quite soon when the refusals start. This helps you catch overload situations before they lead to a significant data loss.

Use the Preview feature to visualize how the detector would behave with your current settings.

Step 3. Add alert details

Now, give your alert a clear and descriptive title, such as Refused Spans. Then, define the event message that will be triggered, for example:

Too many spans refused. Collector overloaded.

In the Advanced section, you can add more context or instructions for your team, making the alert even more actionable.

Once everything looks good, select Save. Your anomaly detector is now ready to go!

Configure alerts with auto-adaptive thresholds

In real-world environments, not all collectors behave the same. For example, memory usage can vary significantly between instances depending on their workload. A static threshold might catch extreme cases (maximum usage across all collectors), but it won’t detect sudden spikes in memory usage on a single collector—spikes that could signal performance degradation in an early phase.

This is where the Auto-Adaptive Analyzer becomes invaluable. Instead of applying a one-size-fits-all threshold, it dynamically creates a baseline for each individual collector, using historical data from the past seven days. These baselines are updated daily, allowing the system to adapt to evolving usage patterns and detect anomalies more precisely.

To set this up:

  1. Create a new Davis anomaly detector with the following scope:
    
    timeseries {avg(otelcol_process_memory_rss)}, by:{service.instance.id}
    

    This configuration monitors the total physical memory (resident set size) per Collector instance.

  2. Select the auto-adaptive threshold. This will automatically generate a tailored threshold for each collector.
  3. Select Preview to visualize how the adaptive thresholds behave across all your collectors.
Figure 4. Auto-adaptive thresholds for multiple collectors
Figure 4. Auto-adaptive thresholds for multiple collectors

Monitor alerts and take action

Once your anomaly detectors are in place, head over to the Problems app in Dynatrace. Here, you can:

  • Track triggered alerts in real time.
  • Configure notifications to ensure the right teams are informed immediately—via email, Slack, or your preferred incident management tool.

This proactive alerting system helps you stay ahead of issues, reducing downtime and improving system reliability.

Try them out

Ready to explore the full potential of OpenTelemetry collector self-monitoring?

  • Allow internal telemetry export from your collectors to Dynatrace.
  • Install OpenTelemetry Dashboards, experiment with different metrics and anomaly detection strategies.
  • Consult our documentation for step-by-step guidance, example configurations, and best practices.

We’d love to hear your feedback as you explore these features!

For more insights into OpenTelemetry Collector scaling and resiliency, check out our scaling guide.