OpenTelemetry And Friends

What is OpenTelemetry

OpenTelemetry (OTel) is an open-source initiative to provide a standardised approach for the capture and distribution of metrics, trace and log data from applications. It defines not just APIs and schemas, but also a set of standard vendor-neutral SDKs and monitoring agents to facilitate collection and exports.

The documentation itself is very detailed, if a little overwhelming in parts. The core to delivering applications that utilise OTel is the OpenTelemetry Collector. The OTel Collector is a small standalone executable that can be run as a sidecar in a container environment, or as a layer if used with something like AWS Lambda.

x-ray-diagram

The OTel Collector, at a high level, can be divided into three main parts:

  • Receivers: Can, via endpoints, receive data from applications, for example using otlpreceiver, or they can be configured in a pull mode to scrape data from an application, for example using the prometheusreceiver receiver. I
  • Processors: Can bundle/modify the metric/trace/log data as it comes through. The common batchprocessor groups and compresses data together to reduce load on exporters.
  • Exporters: Exporters take the metric/trace/log data and export it to a downstream metric/trace/log solution. So for example, you might want to have trace data analysed with
    AWS X-ray, but have log data go to Splunk. In the diagram above, we have metric data going to AWS CloudWatch, using the awsemf exporter.

Arise the OTel Distro

The OTel project released their own standard collector, which includes a bunch of sensible default receivers and exporters. However, what appears to be have happened is that a number of vendors have released expanded distributions of the collector, with additional bundled receivers and exporters to align with their target products. For example AWS has their AWS Distro For Open Telemetry which includes receivers for various AWS container technologies, and exporters for X-ray and CloudWatch. SumoLogic has a distribution which supports exporting to SumoLogic (for both metrics and tracing). These specific distros are great for getting going quickly within a given ecosystem, but can cause issues if you want to route data to multiple providers.

Other providers, such as NewRelic and Honeycomb.io provide HTTP/gRPC endpoints that can directly receive data from the standard otlpexporter exporter, so you can just use the standard collector.

Agents, SDKs, Existing Libraries Oh-My

Now that you understand how the collector works, the next step is getting your valuable application data into the collector. This is going to very much depend on the language your application is written in. In Java land, besides a manual SDK, there is support for an agent-based automatic connector, which out of the box has a lot of support for popular Java libraries. If you would rather not use an agent-based approach, the OpenTelemetry project has an alternative mechanism for injecting trace behaviour specifically for SpringBoot.

SpringBoot doesn't yet support OpenTelemetry natively out of the box, but does support a number of metric export mechanisms, including Prometheus. In this case, the collector can be configured with a prometheus receiver on the collector, which will scrape metrics from SpringBoot at a defined interval.

If you're looking into Quarkus, it supports only tracing currently.

How do I use it with AWS?

AWS has some detailed documentation around integrating with their various services, with X-Ray being the target for tracing information. A common deployment pattern is using the collector as an ECS side-car, which will accept tracing data generated from a primary app, and then diseminate it to X-Ray for evaluation.

The following is a sample CDK pattern, which deploys a Quarkus app (as described in their documentation), into ECS.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import {aws_ec2, aws_ecr_assets, aws_ecs, aws_ecs_patterns, aws_iam} from "aws-cdk-lib";
import {DockerImageAsset, NetworkMode} from "aws-cdk-lib/aws-ecr-assets";
import * as path from "path";
import {ContainerImage} from "aws-cdk-lib/aws-ecs";

export class OtelDemoStack extends cdk.Stack {
    constructor(scope: Construct, id: string, props?: cdk.StackProps) {
        super(scope, id, props);

        const defaultVPC = aws_ec2.Vpc.fromLookup(this, 'ImportVPC',{isDefault: true});

        const cluster = new aws_ecs.Cluster(this, "OtelCluster", {
            vpc: defaultVPC
        });

        //
        const dockerImageServiceB = new DockerImageAsset(this, 'OtelBuildServiceB', {
            directory: path.join(__dirname, '../','../','service_b'),
            networkMode: NetworkMode.HOST,
            file: 'src/main/docker/Dockerfile.jvm',
            platform: aws_ecr_assets.Platform.LINUX_AMD64,
            ignoreMode: undefined,
        })

        const otelPolicy = new aws_iam.PolicyDocument({
            statements: [new aws_iam.PolicyStatement({
                actions: [
                    'logs:PutLogEvents',
                    'logs:CreateLogGroup',
                    'logs:CreateLogStream',
                    'logs:DescribeLogStreams',
                    'logs:DescribeLogGroups',
                    'xray:PutTraceSegments',
                    'xray:PutTelemetryRecords',
                    'xray:GetSamplingRules',
                    'xray:GetSamplingTargets',
                    'xray:GetSamplingStatisticSummaries'
                ],
                resources: ['*'],
            })],
        });

        const taskRole = new aws_iam.Role(this, 'Role', {
            assumedBy: new aws_iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
            inlinePolicies: {
                otelPolicy
            }
        });

        const fargateTaskDefinition = new aws_ecs.FargateTaskDefinition(this, 'TaskDef', {
            memoryLimitMiB: 512,
            cpu: 256,
            taskRole: taskRole
        });
        let webcontainer = fargateTaskDefinition.addContainer("WebContainer", {
            image: ContainerImage.fromDockerImageAsset(dockerImageServiceB),
            logging: new aws_ecs.AwsLogDriver({ streamPrefix: 'ServiceB', mode: aws_ecs.AwsLogDriverMode.NON_BLOCKING })
        });
        webcontainer.addPortMappings({
            containerPort: 8080,
        });
        fargateTaskDefinition.addContainer("Otel", {
            image: aws_ecs.ContainerImage.fromRegistry('amazon/aws-otel-collector:latest'),
            logging: new aws_ecs.AwsLogDriver({ streamPrefix: 'ServiceBOtel', mode: aws_ecs.AwsLogDriverMode.NON_BLOCKING }),
            command: ['--config=/etc/ecs/ecs-default-config.yaml', '--set=service.telemetry.logs.level=DEBUG']
        });

        // // Create a load-balanced Fargate service and make it public
        let serviceB = new aws_ecs_patterns.ApplicationLoadBalancedFargateService(this, "SampleOtelService", {
            cluster: cluster,
            cpu: 512,
            desiredCount: 1,
            taskDefinition: fargateTaskDefinition,
            assignPublicIp: true, // this is purely to not require a NAT option.
            memoryLimitMiB: 1024,
            publicLoadBalancer: true,
            healthCheckGracePeriod: cdk.Duration.seconds(10),
        });
        dockerImageServiceB.repository.grantPull(serviceB.taskDefinition.obtainExecutionRole());

    }
}

It's important to note, that X-ray currently requires a specific time based ID format, which Quarkus discusses in this section. You effectively need to use the X-Ray IDGenerator, or nothing will appear in the X-ray console.

The X-ray console looks like the following. This example demonstrates hitting a sample /hello endpoint, via the deployed load-balancer, hitting the Quarkus application, and then making a HTTP GET call to a Star Wars test endpoint: https://www.swapi.tech/api/starships/3.

x-ray-console

Why/When Would I Use It?

It's a great way of integrating metric and trace monitoring into new applications, or applications that don't have existing solutions in this space. I've excluded logging for the moment as the specification is still in a draft status at the time of writing, but it is certainly an exciting development. This is a pretty fast moving space, which will mature and solidify over time, but it will be the way of integrating vendor-agnostic tracing/metric/logging functionality into your application moving forward.

Conclusion

If you're starting a new project, or retrofitting metric and tracing support to an existing application, defintiely look into OpenTelemetry. It provides a solid, standards compliant mechanism to surface application telemetry, and works well even if you haven't yet decided on your downstream vendors as yet. Stop waiting and start exporting!

Author image
Head of Technology for Ippon Australia. Java/JavaScript full-stack engineer, with a love for anything serverless.
OUR COMPANY
Ippon Technologies is an international consulting firm that specializes in Agile Development, Big Data and DevOps / Cloud. Our 400+ highly skilled consultants are located in the US, France, Australia and Russia. Ippon technologies has a $42 million revenue.