Monitoring cloud-native systems is hard. You’ve got highly distributed apps spanning tens and hundreds of nodes, services and instances. You’ve got additional layers and dimensions—not just bare metal and OS, but also node, pod, namespace, deployment version, Kubernetes’ control plane and more.
To make things more interesting, any typical system these days uses many third-party frameworks, whether open source or cloud services. We didn’t write them but we need to monitor them, nonetheless.
The monitoring challenge comes up often in my discussions with users and customers, as well as in industry surveys. Nearly half (44%) of respondents to the 2020 DevOps Pulse survey say that monitoring/troubleshooting is where they find the most difficulties when running Kubernetes in production.
Observability as a Data Analytics Problem
The way to address the monitoring challenge is with observability. But what is observability in IT systems, anyway? Simply put (and formal definitions aside), observability is the capability to ask and answer questions based on telemetry data. The reason I like this definition is that it makes it clear that observability is essentially a data analytics problem. You bring together telemetry signals of different types and from different sources into one conceptual data lake, and then ask and answer questions to understand your system.
Observability is typically built on three pillars—metrics, logs and traces. Let’s see how these pillars tell us the what, why and where and how that data enables you to answer questions about your system.
Metrics help detect the issues and tell what happened: Is the service down? Was the endpoint slow to respond? Metrics are essentially numerical data that is efficient to collect, process, store, aggregate and manipulate. On the other hand, this numerical data doesn’t contain much context. Once the system emits metrics, the backend collects them, aggregates them, stores them in a time-series database and exposes a designated query language for time-series data.
Next, Logs help diagnose the issues and tell why they happened. Logs are perfect for that job; the developer who writes application code outputs all the relevant context for that code into logs. These logs, however, are textual and verbose and take up lots of storage space. They also require parsing and full-text indexing to effectively search for ad-hoc queries by any field in the logs.
Finally, traces help isolate issues and tell where they happened. As a request comes into the system, it flows through a chain of interacting microservices, which we can trace using distributed tracing. Each call in the chain creates and emits a span for that service and operation (think of it as a structured log), which includes context such as start time, duration and parent span. This context is propagated through the call chain. A tracing backend then collects the emitted spans, and reconstructs the trace according to causality. It then visualizes, typically with a timeline view like a Gantt chart, for further trace analysis.
Role of Open Source: Success and Challenges
Open source is the new norm, with 60% of organizations using open source monitoring tools, according to 451 Research. According to the Cloud Native Computing Foundation (CNCF), The most commonly adopted observability tools are open source, as shown in the End User Technology Radar. In fact, Gartner predicts that by 2025, 70% of new cloud-native application monitoring will use open source instrumentation rather than vendor-specific agents for improved interoperability.
Tool sprawl is a serious challenge
But the wealth of available observability tools creates a consolidation issue. Half of companies are using five or more tools, while a third are using ten or more, according to the CNCF. Tool sprawl is a challenge not just for operating and managing the tools, but also for observability itself; observability is, after all, a data analytics problem, and tools create additional data silos.
Relicensing is Changing OSS landscape
Another new challenge we’re seeing is OSS project relicensing. In the past year alone, we’ve witnessed several relicensing moves for leading OSS projects, whether it was moving to a more restrictive license, a copyleft license (such as GNU AGPL) or even to a non-open source license (non-OSI-compliant, such as SSPL). Typically this happens when a vendor controls the project, not a foundation. It could mean that source code is available, but use and/or modification is restricted or it may mean developers need to open source their own code, in some cases.
This pushes some users to look for alternatives. Among these you can find other OSS projects that can’t consume these licenses or even commercial companies, such as Google, which ban use of AGPL and other licenses. Google open source says on AGPL that “the risks heavily outweigh the benefits.”
Open Source Tools for Logs, Metrics and Traces
The open source landscape for observability is quite dynamic. Many of the OSS projects emerged as recently as the past couple of years. Many are called OpenSomething which adds quite a bit of confusion to the mix. Here’s a quick primer on open source projects according to the signal types.
Open Source Software for Metrics
- Prometheus, a CNCF graduate project, is a monitoring system with a dimensional data model, flexible PromQL query language, efficient time-series database and modern alerting approach with AlertManager.
- OpenMetrics, another CNCF project, offers a format for exposing metrics, which has become a de-facto standard across the industry.
- Grafana, a project by Grafana Labs, offers a powerful analytics and visualization tool that’s exceptionally popular in combination with Prometheus.
Relicensing update: In April 2021, the Grafana project was relicensed from Apache2 to AGPLv3 by Grafana Labs.
Open Source Software for Logs
- ELK Stack, led by Elastic B.V., has been the leading open source choice for a few years. It is comprised of Elasticsearch text distributed data store, Logstash data collection and processing engine and the Kibana visualization tool.
Relicensing update: In February 2021, the Elasticsearch and Kibana projects were relicensed from Apache2 to a non-OSS dual license (SSPL and Elastic License) by Elastic B.V.
- OpenSearch is a fork of the Elasticsearch and Kibana OSS projects which aimed to keep these popular projects open source. The project is led by AWS which also contributed OpenDistro for Elasticsearch; a set of open source plugins for Elasticsearch.
- Loki, led by Grafana Labs, is a log aggregation system specialized for interoperability with Prometheus. Loki doesn’t perform full-text indexing but rather only indexes labels used in Prometheus.
Relicensing update: In April 2021, the Loki project was relicensed from Apache2 to AGPLv3 by Grafana Labs.
Open Source Software for Traces
- Jaeger offers a distributed tracing system released as open source by Uber Technologies, which is now a CNCF graduated project.
- Zipkin is an older Java-based distributed tracing system to collect and look up data from distributed systems.
- Skywalking is an open source APM system that includes monitoring, tracing and diagnostic capabilities for distributed systems in cloud-native architecture.
Unified telemetry collection with OpenTelemetry
Having a variety of tools to choose from also brings up a challenge in telemetry data collection. Organizations find themselves multiple libraries for the logging, metrics, traces, with each vendor having its own APIs, SDKs, agents and collectors.
OpenTelemetry is a novel project under the CNCF that offers a unified set of vendor-agnostic APIs, SDKs and tools for generating and collecting telemetry data, and then exporting it to a variety of analysis tools. The beauty of OpenTelemetry is that it offers an observability framework that works across metrics, traces and logs. You get one API and SDK per programming language for extracting all of your application’s observability data, together with a standard collector, a transmission protocol (OTLP) and more.
OpenTelemetry (or OTel as it’s commonly nicknamed) was created under the CNCF after the merge of the OpenMetrics and OpenTracing projects and was officially accepted to CNCF incubation in August 2021. More importantly, the project is widely adopted by all the major vendors, all the monitoring tools, the cloud providers and many others. As such, it’s well-positioned to become the go-to platform for generating and collecting observability data.
Open source standards such as OpenTelemetry and OpenMetrics are converging in the industry, preventing vendor lock-in and bringing us a step closer to unified observability. I expect we’ll be seeing these projects becoming de-facto standards, as well as seeing additional efforts for unified observability to address the data storage, querying, correlation and other aspects.
The future looks bright for open source-based observability. Join us in the community effort and together we can make it happen.
To hear more about cloud-native topics, join the Cloud Native Computing Foundation and cloud-native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021