The Problem with Generic Observability Tools: Why We're Paying Too Much

Paying to NewRelics, DataDogs, Elastics or Splunks of this world? Getting shocked by the bill at the end of each month?

Finding it hard to justify thousands of dollars spent? Wasting precious engineering cycles on improving the observability ROI? You’re not alone. As an industry it’s time we admit - outsourcing generic observability to a vendor is a costly myth we’ve bought into. Let’s stop digging that hole.

A Story: at a recent observability conference we had a conversation with Daniel who manages Kubernetes at a large SaaS company. We were discussing making their infrastructure more horizontally scalable by switching to a larger number of smaller nodes. Daniel said he tried that but had to go back to a few huge nodes.

The reason: having dozens of ephemeral nodes sending logs and metrics inflated his observability bill so much as to cancel out all the benefits from having a more scalable environment.

Basically he now has the observability vendor dictate his architectural decisions.

Sounds very upside down to me, doesn't it?

‍
The Rise of Generic Observability

About a decade ago there was this #monitoringsucks hashtag popular on the internets. It signified the growing desperation of IT professionals with the existing monitoring stack. The information systems were getting ever more complex, but monitoring (and logging) wasn’t keeping up. Existing tools were fragmented, hard to set up and required a lot of scripting to actually be useful.

‍
This desperation gradually led to an understanding that we were dealing with a cross-cutting set of concerns. That in order to properly discover, analyze and troubleshoot problems in information systems - we need to be able to observe their behaviour. And that required us to correlate the metrics, the logs and the traces. That’s how the observability buzz was born. It even received a numeronym of its own - 011y. But that didn’t catch on as well as k8s or even l10n.

‍
In parallel with that - new standards were getting established for serving the observability building blocks. Prometheus and InfluxDB redefined scalability for time-series databases, ELK stack created a new standard for log aggregation and analysis, Zipkin and Jaeger taught us how to trace our distributed architectures.

‍
A lot of very smart folks invested a lot of effort in building observability solutions and arrived at the following conclusions:

Every modern information system needs observability
Observability is a generic concern
Building observability requires expertise
Therefore - it’s cheaper to buy generic observability than build it.

And this led to a plethora of observability vendors competing for our money, each of them building out “the most comprehensive and complete” observability solution for all our needs.

‍
So What's The Problem Now?

All good until now. We needed observability, it was hard to build - we now have a lot of offerings to choose from!
It’s just that (personal biases aside) - all the available platforms now suffer from the same annoying drawbacks and it only keeps getting worse:

Offering too much: they all try to support every possible stack - VMs, Kubernetes, serverless, databases, APM, traces, profiles, frontend, backend, you name it. It’s very easy to get lost, to turn on something you don’t really need - only to realize you’ve been paying for it a couple of months later.

Not offering enough: with all the abundance of data - it’s very hard to get the data you really need. The ready-made dashboards and alerts rarely fit your specific use case. Operators often find themselves working really hard to configure the alerts that will save them from downtime or dashboards that reliably reflect the health of their system.

Being a cost center rather than a value generation system: observability systems are measured by the amount of data they store. And that is also how we are billed for them. So the vendor will always try to lure you into storing as much data as possible with them. But that data only becomes valuable if we take the next step - implement a feedback loop. Without that - it’s just a cost we see organizations investing multiple man-hours trying to reduce.

No feedback loop: observability stops at collecting and presenting the data. But as we just said - data per se isn’t important - it’s the actions we take based on that data that define how long our system will be down or slow. Or in business terms - how much money we waste or loose. Taking these actions manually stopped being scalable 10 years ago - when the observability buzz was on the rise. So in order to have any observability ROI - we need to automate data analysis and issue resolution - all done in separate tooling.

Bottomline: we’re paying too much for observability and then have to work very hard to justify the bill.

‍
So What Do We Do? So generic observability sucks.

Now what do we do? Do we wrap our own from the available OSS components? It’s always an option, but I’m not suggesting that.

We all know building and running our own has significant maintenance costs. All I’m suggesting is that the time has come to switch from generic observability backends to purpose-built opinionated automation tools that provide us just the data we need and allow us to easily act on that data without leaving the tool.

How PerfectScale by DoiT is Different

I’m not writing all this just because I have an opinion. I’m writing this because these are the understandings that led us to build our solution. At its core PerfectScale can be seen as a kind of an observability tool - it consumes the data from the Kubernetes metrics pipeline, augments it with cloud provider data, shows you insights and sends alerts.

In fact - the first version of PerfectScale was even built on top of an optimized Prometheus instance. But very soon we realized that generic observability tooling is only getting in the way. The actual value of PerfectScale doesn’t lie in collecting or visualizing the data - instead it is found in providing practical recommendations for right-sizing your Kubernetes workloads and further automating it - for autonomous cluster optimization. Instead of being a cost center - PerfectScale saves you money - by reducing waste and improving reliability of your systems. Something none of the generic observability systems can do.

‍

›› PerfectScale releases granular node-level visibility and insights. Take a look now!

‍
Observability 2.0 - Automated Feedback Loops

It’s time we stop the meaningless accumulation of data and start integrating tools that act on that data.

‍
It’s time we say “yes” to generic observability on the collection side (did someone say OpenTelemetry?) and say “no” to generic observability on the backend.

With the evolution of AI and ML we are going to see more and more tools doing what PerfectScale does - collecting just the data they need and creating the feedback loop that acts on that data to generate actual value for the end user. Not just aggregating terrabytes of logs and metrics for the heck of it.

‍