Travis Van
Contributing Writer

Grafana: Shining a light into Kubernetes clusters

Grafana creator Torkel ร–degaard traces the open-source projectโ€™s journey to help developers visualize whatโ€™s going on inside distributed cloud-native infrastructure.

shutterstock 987349 three white clouds in a blue sky
Credit: Bruce Amos / Shutterstock

Back in 2014, when the wave of containers, Kubernetes, and distributed computing was breaking over the technology industry, Torkel ร–degaard was working as a platform engineer at eBay Sweden. Like other devops pioneers, ร–degaard was grappling with the new form factor of microservices and containers and struggling to climb the steep Kubernetes operations and troubleshooting learning curve.ย 

As an engineer striving to make continuous delivery both safe and easy for developers, ร–degaard needed a way to visualize the production state of the Kubernetes system and the behavior of users.ย Unfortunately, there was no specific playbook for how to extract, aggregate, and visualize the telemetry data from these systems. ร–degaardโ€™s search eventually led him to a nascent monitoring tool called Graphite, and to another tool called Kibana that simplified the experience of creating visualizations.

โ€œWith Graphite you could with very little effort send metrics from your application detailing its internal behaviors, and for me, that was so empowering as a developer to actually see real-time insight into what the applications and services were doing and behaving, and what the impact of a code change or new deployment was,โ€ ร–degaard told InfoWorld. โ€œThat was so visually exciting and rewarding and made us feel so much more confident about how things were behaving.โ€

What prompted ร–degaard to start his own side project was that, despite the power of Graphite, it was very difficult to use. It required learning a complicated query language, and clunky processes for building out frameworks. But ร–degaard realized that, if you could combine the monitoring power of Graphite with the ease of Kibana, you could make visualizations for distributed systems much more accessible and useful for developers.

And thatโ€™s how the vision for Grafana was born. Today Grafana and other observability tools fill not a niche in the monitoring landscape but a gaping chasm that traditional network and systems monitoring tools never anticipated.

A cloud operating system

Recent decades have seen two major jumps in infrastructure evolution. First, we went from beefy โ€œscale-upโ€ servers to โ€œscale-outโ€ fleets of commodity Linux servers running in data centers. Then we made another leap to even higher levels of abstraction, approaching our infrastructure as an aggregation of cloud resources that are accessed through APIs.

Throughout this distributed systems evolution driven by aggregations, abstractions, and automation, the โ€œoperating systemโ€ analogy has been repeatedly invoked. Sun Microsystems had the slogan, โ€œThe network is the computer.โ€ UC Berkeley AMPLabโ€™s Matei Zaharia, creator of Apache Spark, co-creator of Apache Mesos, and now CTO and co-founder at Databricks, said โ€œthe data center needs an operating system.โ€ And today, Kubernetes is increasingly referred to as a โ€œcloud operating system.โ€ย 

Calling Kubernetes an operating system draws quibbles from some, who are quick to point out the differences between Kubernetes andย actualย operating systems.

But the analogy is reasonable. You do not need to tell your laptop which core to fire up when you launch an application. You do not need to tell your server which resources to use every time an API request is made. Those processes are automated through operating system primitives. Similarly, Kubernetes (and the ecosystem of cloud-native infrastructure software in its orbit) provides OS-like abstractions that make distributed systems possible by masking low-level operations from the user.

The flip side to all this wonderful abstraction and automation is that understanding whatโ€™s going on under the hood of Kubernetes and distributed systems requires a ton of coordination that falls back to the user. Kubernetes never shipped with a pretty GUI that automagically rolls up system performance metrics, and traditional monitoring tools were never designed to aggregate all of the telemetry data being emitted by these vastly complicated systems.ย 

From zero to 20 million users in 10 years

Dashboard creation and visualization are the common associations that developers draw when they think of Grafana. Its power as a visualization tool and its ability to work with just about any type of data made it a hugely popular open-source project, well beyond distributed computing and cloud-native use cases.ย 

Hobbyists use Grafana visualization for everything from visualizing bee colony activities inside the hive, to tracking carbon footprints in scientific research. Grafana was used in the SpaceX control center for the Falcon 9 launch in 2015, then again by the Japan Aerospace Exploration Agency in its own lunar landing. This is a technology that is literally everywhere you find visualization use cases.

But the real story is Grafanaโ€™s impact on an observability domain that prior to its arrival was defined by proprietary back-end databases and query languages that locked users into specific vendor offerings, major switching costs for vendors to migrate to other users, and walled gardens of supported data sources.

ร–degaard attributes much of the early success of Grafana to the plugin system that he created in its early days. After he personally wrote the InfluxDB and Elasticsearch data sources for Grafana, community members contributed integrations with Prometheus and OpenTSDB, setting off a wave of community plugins to Grafana. Today the project supports more than 160 external data sourcesโ€”what it calls a โ€œbig tentโ€ approach to observability.

The Grafana project continues to work with other open-source projects like OpenTelemetry to provide simple standard semantic models to all telemetry data types and to unify the โ€œpillarsโ€ of observability telemetry data (logs, metrics, traces, profiling). The Grafana community is connected by an โ€œown your own dataโ€ philosophy that continues to attract connectors and integrations with every possible database and telemetry data type.

Grafana futures: New visualizations and telemetry sources

ร–degaard says that Grafanaโ€™s visualization capabilities have been a big personal focus for the evolution of the project.ย โ€œThereโ€™s been a long journey of creating a new React application architecture where third-party developers can build dashboard-like applications in Grafana,โ€ ร–degaard said.ย 

But beyond enriching the ways that third parties can create visualizations on top of this application architecture, the dashboards themselves are getting a big boost in intelligence.ย 

โ€œOne big trend is that dashboardย creationย should eventually be made obsolete,โ€ said ร–degaard. โ€œDevelopers shouldnโ€™t have to build them manually, they should be intelligent enough to generate automatically based on data types, team relationships, and other criteria. By knowing the query language, libraries detected, the programming languages you are writing with, and more. We are working to make the experience much more dynamic, reusable and composable.โ€

ร–degaard also sees Grafana visualization capabilities evolving towards new de-aggregation methodsโ€”being able to go backward from charts to how graphs are composed and break down the data into component dimensions and root causes.

The cloud infrastructure observability journey will continue to see new layers of abstraction and telemetry data. Kernel-level abstraction eBPF is rewriting the rules for how kernel primitives become programmable to platform engineers. Cilium, a project that recently graduated from Cloud Native Computing Foundation incubation, has created a network abstraction layer that allows for even more aggregations and abstractions across multi-cloud environments.

This is only the beginning. Artificial intelligence is introducing new considerations every day for the intersection of programming language primitives, specialized hardware, and the need for humans to understand whatโ€™s happening inside the highly dynamic AI workloads that are so computationally expensive to run.

You write it, youย monitorย it

As Kubernetes and related projects continue to stabilize the cloud operating model, ร–degaard believes that the health monitoring and observability considerations will continue to fall to human operators to instrument, and that observability will be one of the superpowers that distinguish the most sought-after talent.

โ€œIf you write it, you run it, and youย shouldย be on call for the software you writeโ€”thatโ€™s a very important philosophy,โ€ ร–degaard said. โ€œAnd in that vein, when you write software you should be thinking about how to monitor it, how to measure its behavior, not only from a performance and stability perspective but from a business impact perspective.โ€

For a cloud operating system thatโ€™s evolving at breakneck speed, who better than ร–degaard to champion humansโ€™ need to reason with underlying systems? Besides loving to program, he has a passion for natural history and evolution, and reads every book he can get his hands on about natural history and evolutionary psychology.

โ€œIf you donโ€™t think evolution is amazing, somethingโ€™s wrong with you. Itโ€™s the way natureย programs. How much more awesome can it get?โ€

Travis Van
Contributing Writer

Travis Van has been following open source and distributed computing for more than 20 years, with a particular focus on cloud and network infrastructure, programming languages, developer frameworks, and platform engineering trends. He is the founder of information technology news aggregation service TechNews.io. As an InfoWorld contributor, he tells the stories of open source creators and maintainers who are tackling the hardest problems of distributed computing and laying the foundations for the next wave of enterprise computing.

More from this author