Young Gyu Kim <credemol@gmail.com>

Converting Traces to Metrics Using OpenTelemetry Collector for Grafana Dashboards and Alerting

Overview

This guide outlines how to convert distributed trace data into Prometheus metrics using the spanmetrics connector in the OpenTelemetry Collector. The resulting metrics can be visualized in Grafana dashboards and used to configure real-time alerting.

Key topics include:

Configuring the spanmetrics connector in OpenTelemetry Collector
Creating Grafana dashboards for trace visualization
Defining PromQL-based alert rules for traces
Validating alerts using a sample Spring Boot application

Trace-to-Metrics Conversion in OpenTelemetry Collector

OpenTelemetry Collector’s spanmetrics connector converts trace spans into Prometheus-compatible metrics. This enables comprehensive observability by aligning trace data with traditional metric-based monitoring.

Deprecation Notice

The legacy spanmetrics processor is deprecated. Use the spanmetrics connector instead.

References:

Spanmetrics Processor (Deprecated)
For more details, refer to Spanmetrics Connector

Configuration Example

Below is a configuration snippet for enabling the spanmetrics connector in OpenTelemetry Collector (version 0.127.0+):

otel-collector.yaml - spanmetrics connector configuration

spec:
  image: otel/opentelemetry-collector-contrib:latest # 0.127.0 or later
  config:

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"


    connectors:
      spanmetrics:
        histogram:
          explicit:
            #(1)
            buckets: [100ms, 300ms, 500ms, 1s, 2s, 5s, 10s, 30s, 60s, 120s]
        #(2)
        dimensions:
          - name: http.request.method
          - name: http.response.status_code
        exemplars:
          enabled: true
        exclude_dimensions: ['status.code']
        dimensions_cache_size: 1000
        aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
        metrics_flush_interval: 15s
        #(3)
        events:
          enabled: true
          dimensions:
            - name: exception.type
            - name: exception.message
        #(4)
        resource_metrics_key_attributes:
          - service.name
          - telemetry.sdk.language
          - telemetry.sdk.name


    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          #(5)
          exporters: [otlp/jaeger, spanmetrics]

        metrics/spanmetrics:
          #(6)
          receivers: [spanmetrics]
          processors: []
          exporters: [prometheus]

1	Configure the histogram buckets for span durations. These buckets define the ranges for which metrics will be generated based on span durations.
2	Define the dimensions to be included in the generated metrics. These dimensions can be used to filter and group metrics in Prometheus.
3	Enable exemplars to provide additional context for the generated metrics.
4	Specify the resource metrics key attributes to be included in the generated metrics. These attributes provide additional metadata about the service and SDK.
5	Configure the traces pipeline to include the `spanmetrics` connector as an exporter. This allows the connector to process trace spans and generate metrics.
6	Configure a separate metrics pipeline for the `spanmetrics` connector to export metrics to Prometheus.

Metrics Exposed by Spanmetrics Connector

The following Prometheus metrics are generated from trace spans:

traces_span_metrics_calls_total
traces_span_metrics_duration_milliseconds_bucket
traces_span_metrics_duration_milliseconds_sum
traces_span_metrics_duration_milliseconds_count

These metrics provide insight into span volume, latency, and error distribution.

traces_span_metrics_calls_total

This metric counts the total number of trace spans processed by the spanmetrics connector. It can be used to monitor the volume of trace data being processed.

traces_span_metrics_duration_milliseconds_bucket

This metric provides a histogram of the duration of trace spans in milliseconds. It allows you to analyze the distribution of span durations and identify performance bottlenecks.

traces_span_metrics_duration_milliseconds_sum

This metric provides the total duration of all trace spans processed by the spanmetrics connector. It can be used to calculate average span durations and monitor overall performance.

traces_span_metrics_duration_milliseconds_count

This metric counts the number of trace spans processed by the spanmetrics connector. It can be used to monitor the volume of trace data and identify trends over time.

Example PromQL Queries for Dashboards

High Error Rate per Endpoint

This query identifies endpoints with a high error rate per service and span. It calculates the rate of spans with non-2xx HTTP response status codes over the last minute.

sum by (service_name, span_name) (
  rate(traces_span_metrics_duration_milliseconds_count{http_response_status_code!~"2.."}[1m])
)

Long Duration Spans (Duration > 10s)

This query identifies spans that have a duration greater than 10 seconds in the last minute. It calculates the rate of spans with a duration greater than 10 seconds by subtracting the count of spans with a duration less than or equal to 10 seconds from the total count of spans. When the result is greater than 0, it indicates that there are spans with a duration greater than 10 seconds.

rate(traces_span_metrics_duration_milliseconds_count[1m])
-
ignoring(le) rate(traces_span_metrics_duration_milliseconds_bucket{le="10000.0"}[1m])

Traces Dashboard

This dashboard contains panels for visualizing trace metrics from Java applications. It includes panels for high error rates, long duration spans.

Figure 1. Grafana UI - Java Application Traces Dashboard

Alert Rules Using PromQL

High Error Rate per Endpoints (≥ 3 Errors in 1 Minute)

This query identifies endpoints with a high error rate, defined as having more than 3 errors in the last minute. It can be used to trigger alerts for endpoints that are experiencing issues.

The value 0.05 is calculated based on the expected number of errors per minute. For example, if you expect 3 errors per minute, the rate would be 3/60 = 0.05.

sum by (service_name, span_name) (
  rate(traces_span_metrics_duration_milliseconds_count{http_response_status_code!~"2.."}[1m])
) > 0.05

With this query, you can monitor the error rates of your endpoints and take proactive measures to address any issues that arise.

The pending period for this alert rule is set to 0 minutes, meaning that the alert will be triggered immediately when the condition is met.

Figure 2. Grafana UI - High Error Rate Alert Rule

Long Duration Spans (≥ 10s) in the Last Minute

This query identifies spans that have a duration of 10 seconds or more in the last minute. It can be used to monitor performance issues and identify slow spans.

rate(traces_span_metrics_duration_milliseconds_count[1m])
-
ignoring(le) rate(traces_span_metrics_duration_milliseconds_bucket{le="10000.0"}[1m]) > 0

This query calculates the rate of spans with a duration greater than 10 seconds by subtracting the count of spans with a duration less than or equal to 10 seconds from the total count of spans in the last minute.

Figure 3. Grafana UI - Long Duration Spans Alert Rule

The pending period for this alert rule is set to 0 minutes, meaning that the alert will be triggered immediately when the condition is met.

Alert Testing with Sample Application

The otel-spring-example application included in the service-foundry-builder project can be used to test your alert configuration.

Accessing the Application

To access the application, you can use the following URL:

$ kubectl port-forward service/otel-spring-example 8080:8080 -n o11y

Simulating High Error Rates

Use ErrorController to trigger controlled errors:

ErrorController.java

@RequestMapping("/error")
@RestController
@RequiredArgsConstructor
@Slf4j
public class ErrorController {
    @GetMapping("/cause/{samplingRate}")
    public Map<String, String> causeError(@PathVariable double samplingRate) {
        // Simulate an error based on the sampling rate
        log.info("cause-error - samplingRate: {}", samplingRate);

        if (Math.random() < samplingRate) {
            log.info("An error occurred for sampling rate: {}", samplingRate);
            throw new ErrorControllerException(samplingRate);
        } else {
            log.info("No error occurred for sampling rate: {}", samplingRate);
            return Map.of("status", "success", "samplingRate", String.valueOf(samplingRate), "message", "No error occurred");
        }

    }
}

Send test requests:

for i in {1..100}; do curl -X GET http://localhost:8080/error/cause/0.2; done

This command will send 100 requests to the /error/cause/0.2 endpoint, where approximately 20% of the requests will result in an error. This should trigger the high error rate alert in Grafana.

After running the command, you can check the Grafana dashboard to see if the alert for high error rates has been triggered.

Figure 4. Grafana UI - Firing High Error Rate Alert

And the notification will be sent to the configured notification channels, such as email or Slack.

Figure 5. Email Notification for High Error Rate Alert

Simulating Long Duration Spans

Use SleepController to simulate delays:

SleepController.java

@RestController
@RequestMapping("/sleep")
@Slf4j
public class SleepController {

    @GetMapping("/{sleepInSeconds}")
    @PostMapping("/{sleepInSeconds}")
    public Map<String, Object> sleep(@PathVariable long sleepInSeconds) {
        log.info("Sleeping for {} seconds", sleepInSeconds);
        try {
            Thread.sleep(sleepInSeconds * 1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return Map.of("status", "success", "message", "Slept for " + sleepInSeconds + " seconds");
    }
}

Send test requests:

$ for i in {1..10}; do curl -X GET "http://localhost:8080/sleep/$(( (RANDOM % 15) + 1))"; done

This command will send 10 requests to the /sleep/{sleepInSeconds} endpoint, where each request will sleep for a random duration between 1 and 15 seconds. This should trigger the long duration spans alert in Grafana.

After running the command, you can check the Grafana dashboard to see if the alert for long duration spans has been triggered.

Figure 6. Grafana UI - Firing Long Duration Spans Alert

And the notification will be sent to the configured notification channels, such as email or Slack.

Figure 7. Email Notification for Long Duration Spans Alert

Conclusion

This document demonstrates how to extend your observability stack by converting traces into metrics using the OpenTelemetry Collector. With the spanmetrics connector, trace spans are transformed into Prometheus metrics, enabling unified visualization and alerting through Grafana. The included PromQL queries and Spring Boot examples allow you to validate your alert rules and proactively monitor application health.

You can also view this document in web format at: https://nsalexamy.github.io/service-foundry/pages/documents/o11y-foundry/convert-traces-to-metrics/