Além do Monitoramento: O Custo Oculto da Observabilidade em Escala

TL;DR:* Observability bills don’t explode because of traffic — they explode because of unchecked cardinality.*

This article is for engineers operating cloud-native systems at scale — where observability is no longer free.

When observability costs spike, the first assumption is that traffic growth is the cause. More users, more requests, more data.

In reality, that’s rarely the root cause.

In large-scale environments — thousands of containers spinning up and down daily across EKS clusters or bare-metal nodes — observability costs usually grow for a much quieter reason:

You’re storing far more metrics and logs than the business will ever use.

Most teams focus on coverage:

✔ everything emits metrics

✔ everything logs

✔ everything is centralized

✔ alerts reach the right teams

That’s necessary — but incomplete.

The question that often gets ignored until finance asks is:

Are these metrics intentionally designed, or just accidentally collected?

The Cardinality Trap

Every label, dimension, or high-cardinality field you send to an observability backend directly affects cost.

Platforms like Grafana Cloud, Datadog, and New Relic bill based on active series, custom metrics, or similar units. Each unique combination of labels becomes billable.

A single unbounded label — like request_id, user_id, or collection_id — can quietly explode your series count.

For example:

A metric with 5 labels
One label has 200k unique values per day
Multiplied across pods and nodes

That’s not “a little extra metadata.”

That’s hundreds of thousands of new active series — per metric.

Histograms make this worse. A histogram with:

10 buckets
5 labels

Already creates 50 time series per instance. Multiply that across containers, nodes, and environments, and the cost scales exponentially, not linearly.

The most dangerous part?

You won’t see this in dashboards.

You’ll see it in the month-end invoice.

We learned this the hard way:

A single high-cardinality label led to an 81% increase in our Grafana Cloud costs in a single billing cycle.

No traffic spike. No incident. Just silent cardinality growth.

Monitor Smarter, Not More

The solution isn’t to monitor less — it’s to monitor intentionally.

In practice, that means treating observability data with the same discipline as production code.

What worked for us:

• Implementing cardinality alerts on the observability platform itself

• Using relabeling rules to drop unbounded labels before ingestion

• Regularly auditing which metrics actually drive decisions

In our case, a single relabeling rule eliminated most of the cost increase:

- source_labels: [collection_id]
  action: drop

We also filtered out unnecessary histogram buckets before they reached the backend using Alloy’s prometheus.relabel.

Think of this as observability hygiene.

Just as you wouldn’t ship debug logs indefinitely in production, you shouldn’t ship every possible metric dimension to a paid observability platform.

The Goal: Intentional Observability

Intentional observability means:

Collecting what drives decisions
Dropping what doesn’t
Detecting cardinality explosions automatically
Preventing cost surprises before finance escalates them

Observability systems scale automatically.

Costs don’t — unless you design guardrails.

If your metrics pipeline has autoscaling but no cardinality controls, observability itself becomes an ungoverned production dependency.

And that’s how teams end up flying blind — not operationally, but financially.

In mature systems, observability isn’t just about visibility — it’s about responsibility. What you choose to measure should be as intentional as the systems you run.

Publicado originalmente no Medium.