Fri Feb 06

O Paradoxo da Confiabilidade em Observabilidade

Você construiu redundância em tudo — menos no sistema que avisa quando a redundância falha.

Diagrama de paradoxo em observabilidade e confiabilidade

TL;DR: Your applications have HA, failover, and circuit breakers — but your monitoring pipeline probably doesn’t. When Prometheus goes down, your absent() alerts can’t fire. When AlertManager crashes, no one gets paged. Meta-monitoring — monitoring your monitoring — is the missing reliability layer most teams only discover after a real incident.

This article is for engineers who already invest in observability — and assume it will always be there when they need it.

In the first article of this series, we talked about the financial cost of uncontrolled observability — how cardinality explosions silently drain budgets.

In the second article of this series, we covered the operational risk of missing data — how absent() helps you detect when critical metrics disappear.

This third article addresses a deeper question:

What happens when the system responsible for detecting failures is itself the thing that fails?

Scope note: this article assumes a standard single-Prometheus + AlertManager setup without a separate monitoring plane (no Thanos/Mimir/Cortex/managed HA layer). If you already run a fully redundant, externalized monitoring stack, the failure modes change — but the meta-monitoring principles still apply.

If you’re already running multi-region Thanos/Mimir or a managed HA observability platform, you can skim to Layer 4 for the notification-path guidance.

The Unspoken Assumption

Every observability setup has an implicit contract:

  • Prometheus will always be scraping
  • AlertManager will always be routing
  • Grafana will always be rendering
  • Agents will always be collecting

Teams build multi-region failover for their databases. They implement circuit breakers for their APIs. They run chaos engineering against their microservices.

But the monitoring stack? It runs as a single instance on a node somewhere, with no redundancy, no health checks, and no alerting on its own health.

The irony is hard to miss:

You’ve engineered resilience into everything — except the system that tells you when resilience fails.

Observability as an Unmanaged Dependency

Let’s be specific about the failure surfaces most teams overlook.

Prometheus Is Stateful — and Fragile by Default

Prometheus stores its TSDB on local disk. It doesn’t natively replicate. It doesn’t cluster.

If the node running Prometheus faces memory pressure, Prometheus is often the first thing the OOM killer takes out. When it restarts, there’s a gap in your time series. If the disk fills up, TSDB corruption can make historical data unrecoverable.

Running two Prometheus instances scraping the same targets helps — but introduces its own complexity: duplicate alerts, diverging label sets, and no built-in deduplication without adding Thanos or Cortex.

AlertManager Clustering Has Edge Cases

AlertManager supports clustering via gossip protocol, and on paper it looks straightforward. In practice:

  • Silences can fail to propagate across cluster members during network partitions
  • Inhibition rules may behave inconsistently if cluster state diverges
  • A misconfigured peer can silently drop alerts without any visible error

Most teams discover these issues during incidents — exactly when they need AlertManager to work flawlessly.

Reference: AlertManager clustering and HA behavior.

Agents Add Another Failure Surface

If you run Grafana Alloy, Grafana Agent, or any telemetry collector as a DaemonSet, you’ve added another component that can fail:

  • An agent OOMKilled on a node means that node goes dark
  • A bad config pushed via GitOps can break collection across the entire fleet
  • Agent crashes don’t trigger alerts — because the agent is the thing that feeds the alerting system

The absent() Paradox

In the previous article, we discussed using absent() to detect missing metrics. But here’s the catch:

absent() is evaluated by Prometheus. If Prometheus is down, absent() never fires.

And even if Prometheus fires the alert, it’s delivered through AlertManager. If AlertManager is down, nobody gets paged.

This creates a dangerous chain:

Prometheus down → absent() can't evaluate → AlertManager irrelevant → Team is blind

Your carefully crafted alerting rules become useless at the exact moment they matter most.

A Real Failure Scenario

Let me walk you through a failure pattern we actually experienced.

Friday, 11:47 PM. A noisy neighbor workload on a shared EKS node causes memory pressure. The kubelet starts evicting pods based on QoS class. Prometheus, running as a Deployment with no guaranteed resources, gets OOMKilled.

11:48 PM. Prometheus restarts. But the TSDB WAL replay takes 4 minutes. During this window, no scrapes happen. No alerts are evaluated.

11:49 PM. Meanwhile, a separate issue: a deployment rollout in the payment service introduces a bug. Transaction error rates spike to 15%.

11:52 PM. Prometheus finishes WAL replay and resumes scraping. The error rate alert has a for: 5m clause (e.g., rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for 5m), so the evaluation window restarts. It needs 5 more minutes of sustained errors before firing.

11:57 PM. The alert finally fires, but AlertManager has its own problem — a rotated TLS cert breaks the webhook receiver. The webhook returns a 503. AlertManager retries, and backoff delays the notification by another 8 minutes.

12:05 AM Saturday. An engineer finally sees a Slack message in a channel they don’t have notifications enabled for. By the time they respond, the payment service has been silently failing for 18 minutes.

No single component catastrophically failed. The monitoring pipeline degraded at every layer, and nothing was monitoring the monitors.

The result: 18 minutes of blind flight during an active incident.

Proving the Blind Spot

Theory is one thing. Seeing it happen is another.

We built a local lab environment — Prometheus, AlertManager, Blackbox Exporter, Grafana, and a sample payment service — and deliberately broke things. Here’s what we found.

Baseline: Everything Healthy

With all services running, the monitoring pipeline looks exactly how you’d expect:

Scrape Targets:
  prometheus                 health=up
  alertmanager               health=up
  payment-service            health=up
  blackbox-exporter          health=up
  monitoring-health (x4)     health=up
Blackbox Health Probes:
  [http://prometheus:9090/-/healthy](http://prometheus:9090/-/healthy)       probe=OK
  [http://alertmanager:9093/-/healthy](http://alertmanager:9093/-/healthy)     probe=OK
  [http://grafana:3000/api/health](http://grafana:3000/api/health)         probe=OK
  [http://payment-service:8000/health](http://payment-service:8000/health)     probe=OK
Alert Rules:
  Watchdog                               state=firing  ← always on (heartbeat)
  TargetDown                             state=inactive
  MonitoringComponentDown                state=inactive
  PaymentServiceAbsent                   state=inactive
  TransactionMetricsAbsent               state=inactive
AlertManager Active Alerts:
  Watchdog                               status=active

Everything is green. The Watchdog heartbeat is firing. Probes are passing. Metrics are flowing — now let’s break it.

Scenario 1: Kill Prometheus

We stop Prometheus — simulating an OOMKill or node eviction.

Immediate effect:

Prometheus Health Endpoint:
  CONNECTION REFUSED (Prometheus is down)
Can we query metrics?
  CONNECTION REFUSED (no metrics available)
Payment Service:
  OK (app is fine, but nobody is watching it)
Grafana:
  Cannot reach Prometheus — dashboards are frozen

Dashboards freeze. No new data points. But here’s the critical part — what happens to alerting?

AlertManager state:

AlertManager Active Alerts:
  Watchdog                   status=active (stale — will expire)

The Watchdog alert is still listed, but it’s stale. No new heartbeats are arriving. It will expire based on AlertManager’s resolve_timeout, and then — silence. Zero active alerts. An empty AlertManager UI.

The blackbox-exporter tells the truth:

Blackbox probe for Prometheus:
  probe_http_status_code = 0
  probe_success = 0 (FAIL — Prometheus unreachable)

The blackbox-exporter detected that Prometheus is down. But who reads blackbox results? Prometheus does — and Prometheus is dead.

That’s the blind spot.

Scenario 2: Kill AlertManager

We restore Prometheus, then stop AlertManager.

Prometheus immediately detects the problem:

Prometheus Firing Alerts:
  [none    ] Watchdog
  [critical] TargetDown                  target=alertmanager:9093
  [critical] MonitoringComponentDown     target=http://alertmanager:9093/-/healthy

Prometheus knows. It sees AlertManager is down via scrape failure (up == 0). It sees the blackbox probe failing (probe_success == 0). It fires two critical alerts.

But can it tell anyone?

Prometheus Notification Delivery:
  notifications_dropped_total = 10
  notifications_errors_total  = 9
Blackbox probe for AlertManager:
  probe_http_status_code = 0
  probe_success = 0 (FAIL — AlertManager unreachable)

Ten dropped notifications. Nine delivery errors. Prometheus is generating alerts and trying to send them, but the destination — AlertManager — is the very thing that’s broken.

Prometheus screams into the void. Nobody hears it.

What This Proves

These aren’t theoretical risks. In a standard Prometheus + AlertManager setup:

  1. Killing Prometheus → zero alerts fire, absent() can’t evaluate, dashboards freeze, Watchdog expires silently
  2. Killing AlertManager → Prometheus detects the failure and fires alerts, but can’t deliver them to anyone

In both cases, the team is blind. The failure is detectable but undeliverable.

This is exactly why meta-monitoring through an independent path isn’t optional — it’s the only thing that catches these failures in practice.

Want to reproduce this? The companion lab environment includes all the configs and chaos scripts. Run docker compose up, break things, see it yourself.

The Solution: Meta-Monitoring and Layered Reliability

Meta-monitoring isn’t complicated. But it requires a deliberate architectural decision:

The system that monitors your monitoring must not share failure domains with the monitoring itself.

Here’s a layered approach that works in practice.

Regular path:  Targets → Prometheus → AlertManager → Slack
Meta path:     Watchdog/Blackbox → External heartbeat → PagerDuty/SMS

Layer 1: The Watchdog Alert (DeadMansSwitch)

The simplest and most effective pattern. Create an alert that always fires:

groups:
  - name: meta
    rules:
      - alert: Watchdog
        expr: vector(1)
        labels:
          severity: none
        annotations:
          summary: "Watchdog alert - monitoring pipeline is alive"
          description: >
            This alert is always firing. If you stop receiving it,
            that means Prometheus or AlertManager is down.

This alert is always active. Its purpose isn’t to notify you about a problem — it’s a heartbeat. If you stop receiving it, something in your pipeline is broken.

The key is where you send it: not through your regular notification channels.

Route it to an external service that expects periodic check-ins:

  • PagerDuty (via Events API v2 heartbeat monitoring)
  • Healthchecks.io (free tier available)
  • Deadman’s Snitch
  • OpsGenie heartbeat

If the heartbeat stops arriving, the external service alerts you through a completely independent path.

Layer 2: External Health Probes

Use blackbox-exporter (or an external monitoring service) to probe the health endpoints of your monitoring stack:

# Prometheus health
- targets:
    - http://prometheus:9090/-/healthy
    - http://alertmanager:9093/-/healthy
    - http://grafana:3000/api/health

This catches scenarios where the service is running but unhealthy — a loaded Prometheus that’s falling behind on scrapes, an AlertManager that can’t reach its notification backends, or a Grafana instance with a broken data source.

Critically, the blackbox-exporter results should feed into a different Prometheus instance or an external monitoring system — not back into the same pipeline you’re trying to protect.

Layer 3: Cross-Instance Monitoring

If you run multiple Prometheus instances (which you should for any serious production setup), have them monitor each other:

# prometheus-a scrape config
scrape_configs:
  - job_name: 'prometheus-b'
    static_configs:
      - targets: ['prometheus-b:9090']
# prometheus-b scrape config
scrape_configs:
  - job_name: 'prometheus-a'
    static_configs:
      - targets: ['prometheus-a:9090']

Each instance can alert on the other’s health:

- alert: PeerPrometheusDown
  expr: up{job="prometheus-b"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Peer Prometheus instance is down"

This creates mutual supervision — neither instance can fail silently as long as the other is running.

Layer 4: Separate Notification Path

This is the part most teams skip, and it’s the most important:

Meta-alerts must not travel through the same notification path as regular alerts.

If your regular alerts go: Prometheus → AlertManager → Slack webhook → #alerts channel

Your meta-alerts should go: External service → PagerDuty/OpsGenie direct → phone call/SMS

The moment your meta-alert shares infrastructure with your regular alerts, you’ve reintroduced the single point of failure you were trying to eliminate.

Practical Patterns That Actually Work

The examples below assume you route meta-monitoring signals through an independent path. If you point blackbox results into the same Prometheus you’re trying to protect, treat it as a baseline only — not true failure-domain isolation.

AlertManager DeadMansSwitch Routing

route:
  receiver: 'default'
  routes:
    - match:
        alertname: Watchdog
      receiver: 'deadmansswitch'
      repeat_interval: 1m
      group_wait: 0s
      group_interval: 1m
receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
  - name: 'deadmansswitch'
    webhook_configs:
      - url: 'https://hc-ping.com/your-uuid-here'
        send_resolved: false

The Watchdog alert fires every minute to an external heartbeat service. If it stops, you know your pipeline is broken — regardless of what your dashboards say.

Blackbox Probing Your Own Stack

# In prometheus.yml
scrape_configs:
  - job_name: 'monitoring-health'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - http://prometheus:9090/-/healthy
          - http://alertmanager:9093/-/healthy
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Combined with an alert rule:

- alert: MonitoringComponentDown
  expr: probe_success{job="monitoring-health"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Monitoring component {{ $labels.instance }} is unhealthy"

Resource Guarantees for Monitoring Pods

Don’t let the kubelet decide your monitoring is expendable:

resources:
  requests:
    memory: "2Gi"
    cpu: "500m"
  limits:
    memory: "4Gi"
    cpu: "1000m"

Set PriorityClass to ensure monitoring pods survive node pressure:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: monitoring-critical
value: 1000000
globalDefault: false
description: "Priority class for monitoring components"

Monitoring should be among the last things evicted, not the first.

Tying It All Together

Across this three-part series, we’ve covered three dimensions of observability maturity:

Article 1 — Cardinality → Financial discipline Are we paying for metrics that drive decisions?

Article 2 — Absent data → Data reliability Do we know when measurement itself disappears?

Article 3 — Meta-monitoring → System reliability Can our monitoring survive its own failures?

Each layer builds on the previous:

  • Cardinality controls keep your meta-monitoring cost-effective (you don’t need high-cardinality labels on watchdog alerts)
  • absent() rules detect data gaps — but only if the pipeline evaluating them is alive
  • Meta-monitoring ensures the entire chain is operational, from scrape to notification

Quick Checklist (This Week)

  • Create a Watchdog alert and route it to an external heartbeat service
  • Probe Prometheus/AlertManager/Grafana health from an independent path
  • Add a meta-alert channel that does not share the normal alerting pipeline

The Real Takeaway

In mature engineering organizations, observability isn’t treated as tooling — it’s treated as critical infrastructure.

It deserves:

  • Its own SLOs (scrape success rate, alert evaluation latency, notification delivery time)
  • Its own incident runbooks
  • Its own failure domains, separate from the systems it monitors
  • Its own monitoring — yes, monitors all the way down

The paradox is real: the thing that tells you everything is fine can itself silently fail.

The solution isn’t infinite recursion. It’s layered reliability with independent failure domains.

A Watchdog alert. An external heartbeat. A separate notification path.

Three simple patterns that turn your monitoring from a single point of failure into a system worthy of the infrastructure it protects.

If your monitoring goes down and nobody notices, was it ever monitoring?

Publicado originalmente no Medium.