Fri Feb 06
O Paradoxo da Confiabilidade em Observabilidade
Você construiu redundância em tudo — menos no sistema que avisa quando a redundância falha.
TL;DR: Your applications have HA, failover, and circuit breakers — but your monitoring pipeline probably doesn’t. When Prometheus goes down, your absent() alerts can’t fire. When AlertManager crashes, no one gets paged. Meta-monitoring — monitoring your monitoring — is the missing reliability layer most teams only discover after a real incident.
This article is for engineers who already invest in observability — and assume it will always be there when they need it.
In the first article of this series, we talked about the financial cost of uncontrolled observability — how cardinality explosions silently drain budgets.
In the second article of this series, we covered the operational risk of missing data — how absent() helps you detect when critical metrics disappear.
This third article addresses a deeper question:
What happens when the system responsible for detecting failures is itself the thing that fails?
Scope note: this article assumes a standard single-Prometheus + AlertManager setup without a separate monitoring plane (no Thanos/Mimir/Cortex/managed HA layer). If you already run a fully redundant, externalized monitoring stack, the failure modes change — but the meta-monitoring principles still apply.
If you’re already running multi-region Thanos/Mimir or a managed HA observability platform, you can skim to Layer 4 for the notification-path guidance.
The Unspoken Assumption
Every observability setup has an implicit contract:
- Prometheus will always be scraping
- AlertManager will always be routing
- Grafana will always be rendering
- Agents will always be collecting
Teams build multi-region failover for their databases. They implement circuit breakers for their APIs. They run chaos engineering against their microservices.
But the monitoring stack? It runs as a single instance on a node somewhere, with no redundancy, no health checks, and no alerting on its own health.
The irony is hard to miss:
You’ve engineered resilience into everything — except the system that tells you when resilience fails.
Observability as an Unmanaged Dependency
Let’s be specific about the failure surfaces most teams overlook.
Prometheus Is Stateful — and Fragile by Default
Prometheus stores its TSDB on local disk. It doesn’t natively replicate. It doesn’t cluster.
If the node running Prometheus faces memory pressure, Prometheus is often the first thing the OOM killer takes out. When it restarts, there’s a gap in your time series. If the disk fills up, TSDB corruption can make historical data unrecoverable.
Running two Prometheus instances scraping the same targets helps — but introduces its own complexity: duplicate alerts, diverging label sets, and no built-in deduplication without adding Thanos or Cortex.
AlertManager Clustering Has Edge Cases
AlertManager supports clustering via gossip protocol, and on paper it looks straightforward. In practice:
- Silences can fail to propagate across cluster members during network partitions
- Inhibition rules may behave inconsistently if cluster state diverges
- A misconfigured peer can silently drop alerts without any visible error
Most teams discover these issues during incidents — exactly when they need AlertManager to work flawlessly.
Reference: AlertManager clustering and HA behavior.
Agents Add Another Failure Surface
If you run Grafana Alloy, Grafana Agent, or any telemetry collector as a DaemonSet, you’ve added another component that can fail:
- An agent OOMKilled on a node means that node goes dark
- A bad config pushed via GitOps can break collection across the entire fleet
- Agent crashes don’t trigger alerts — because the agent is the thing that feeds the alerting system
The absent() Paradox
In the previous article, we discussed using absent() to detect missing metrics. But here’s the catch:
absent() is evaluated by Prometheus. If Prometheus is down, absent() never fires.
And even if Prometheus fires the alert, it’s delivered through AlertManager. If AlertManager is down, nobody gets paged.
This creates a dangerous chain:
Prometheus down → absent() can't evaluate → AlertManager irrelevant → Team is blind
Your carefully crafted alerting rules become useless at the exact moment they matter most.
A Real Failure Scenario
Let me walk you through a failure pattern we actually experienced.
Friday, 11:47 PM. A noisy neighbor workload on a shared EKS node causes memory pressure. The kubelet starts evicting pods based on QoS class. Prometheus, running as a Deployment with no guaranteed resources, gets OOMKilled.
11:48 PM. Prometheus restarts. But the TSDB WAL replay takes 4 minutes. During this window, no scrapes happen. No alerts are evaluated.
11:49 PM. Meanwhile, a separate issue: a deployment rollout in the payment service introduces a bug. Transaction error rates spike to 15%.
11:52 PM. Prometheus finishes WAL replay and resumes scraping. The error rate alert has a for: 5m clause (e.g., rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for 5m), so the evaluation window restarts. It needs 5 more minutes of sustained errors before firing.
11:57 PM. The alert finally fires, but AlertManager has its own problem — a rotated TLS cert breaks the webhook receiver. The webhook returns a 503. AlertManager retries, and backoff delays the notification by another 8 minutes.
12:05 AM Saturday. An engineer finally sees a Slack message in a channel they don’t have notifications enabled for. By the time they respond, the payment service has been silently failing for 18 minutes.
No single component catastrophically failed. The monitoring pipeline degraded at every layer, and nothing was monitoring the monitors.
The result: 18 minutes of blind flight during an active incident.
Proving the Blind Spot
Theory is one thing. Seeing it happen is another.
We built a local lab environment — Prometheus, AlertManager, Blackbox Exporter, Grafana, and a sample payment service — and deliberately broke things. Here’s what we found.
Baseline: Everything Healthy
With all services running, the monitoring pipeline looks exactly how you’d expect:
Scrape Targets:
prometheus health=up
alertmanager health=up
payment-service health=up
blackbox-exporter health=up
monitoring-health (x4) health=up
Blackbox Health Probes:
[http://prometheus:9090/-/healthy](http://prometheus:9090/-/healthy) probe=OK
[http://alertmanager:9093/-/healthy](http://alertmanager:9093/-/healthy) probe=OK
[http://grafana:3000/api/health](http://grafana:3000/api/health) probe=OK
[http://payment-service:8000/health](http://payment-service:8000/health) probe=OK
Alert Rules:
Watchdog state=firing ← always on (heartbeat)
TargetDown state=inactive
MonitoringComponentDown state=inactive
PaymentServiceAbsent state=inactive
TransactionMetricsAbsent state=inactive
AlertManager Active Alerts:
Watchdog status=active
Everything is green. The Watchdog heartbeat is firing. Probes are passing. Metrics are flowing — now let’s break it.
Scenario 1: Kill Prometheus
We stop Prometheus — simulating an OOMKill or node eviction.
Immediate effect:
Prometheus Health Endpoint:
CONNECTION REFUSED (Prometheus is down)
Can we query metrics?
CONNECTION REFUSED (no metrics available)
Payment Service:
OK (app is fine, but nobody is watching it)
Grafana:
Cannot reach Prometheus — dashboards are frozen
Dashboards freeze. No new data points. But here’s the critical part — what happens to alerting?
AlertManager state:
AlertManager Active Alerts:
Watchdog status=active (stale — will expire)
The Watchdog alert is still listed, but it’s stale. No new heartbeats are arriving. It will expire based on AlertManager’s resolve_timeout, and then — silence. Zero active alerts. An empty AlertManager UI.
The blackbox-exporter tells the truth:
Blackbox probe for Prometheus:
probe_http_status_code = 0
probe_success = 0 (FAIL — Prometheus unreachable)
The blackbox-exporter detected that Prometheus is down. But who reads blackbox results? Prometheus does — and Prometheus is dead.
That’s the blind spot.
Scenario 2: Kill AlertManager
We restore Prometheus, then stop AlertManager.
Prometheus immediately detects the problem:
Prometheus Firing Alerts:
[none ] Watchdog
[critical] TargetDown target=alertmanager:9093
[critical] MonitoringComponentDown target=http://alertmanager:9093/-/healthy
Prometheus knows. It sees AlertManager is down via scrape failure (up == 0). It sees the blackbox probe failing (probe_success == 0). It fires two critical alerts.
But can it tell anyone?
Prometheus Notification Delivery:
notifications_dropped_total = 10
notifications_errors_total = 9
Blackbox probe for AlertManager:
probe_http_status_code = 0
probe_success = 0 (FAIL — AlertManager unreachable)
Ten dropped notifications. Nine delivery errors. Prometheus is generating alerts and trying to send them, but the destination — AlertManager — is the very thing that’s broken.
Prometheus screams into the void. Nobody hears it.
What This Proves
These aren’t theoretical risks. In a standard Prometheus + AlertManager setup:
- Killing Prometheus → zero alerts fire,
absent()can’t evaluate, dashboards freeze, Watchdog expires silently - Killing AlertManager → Prometheus detects the failure and fires alerts, but can’t deliver them to anyone
In both cases, the team is blind. The failure is detectable but undeliverable.
This is exactly why meta-monitoring through an independent path isn’t optional — it’s the only thing that catches these failures in practice.
Want to reproduce this? The companion lab environment includes all the configs and chaos scripts. Run
docker compose up, break things, see it yourself.
The Solution: Meta-Monitoring and Layered Reliability
Meta-monitoring isn’t complicated. But it requires a deliberate architectural decision:
The system that monitors your monitoring must not share failure domains with the monitoring itself.
Here’s a layered approach that works in practice.
Regular path: Targets → Prometheus → AlertManager → Slack
Meta path: Watchdog/Blackbox → External heartbeat → PagerDuty/SMS
Layer 1: The Watchdog Alert (DeadMansSwitch)
The simplest and most effective pattern. Create an alert that always fires:
groups:
- name: meta
rules:
- alert: Watchdog
expr: vector(1)
labels:
severity: none
annotations:
summary: "Watchdog alert - monitoring pipeline is alive"
description: >
This alert is always firing. If you stop receiving it,
that means Prometheus or AlertManager is down.
This alert is always active. Its purpose isn’t to notify you about a problem — it’s a heartbeat. If you stop receiving it, something in your pipeline is broken.
The key is where you send it: not through your regular notification channels.
Route it to an external service that expects periodic check-ins:
- PagerDuty (via Events API v2 heartbeat monitoring)
- Healthchecks.io (free tier available)
- Deadman’s Snitch
- OpsGenie heartbeat
If the heartbeat stops arriving, the external service alerts you through a completely independent path.
Layer 2: External Health Probes
Use blackbox-exporter (or an external monitoring service) to probe the health endpoints of your monitoring stack:
# Prometheus health
- targets:
- http://prometheus:9090/-/healthy
- http://alertmanager:9093/-/healthy
- http://grafana:3000/api/health
This catches scenarios where the service is running but unhealthy — a loaded Prometheus that’s falling behind on scrapes, an AlertManager that can’t reach its notification backends, or a Grafana instance with a broken data source.
Critically, the blackbox-exporter results should feed into a different Prometheus instance or an external monitoring system — not back into the same pipeline you’re trying to protect.
Layer 3: Cross-Instance Monitoring
If you run multiple Prometheus instances (which you should for any serious production setup), have them monitor each other:
# prometheus-a scrape config
scrape_configs:
- job_name: 'prometheus-b'
static_configs:
- targets: ['prometheus-b:9090']
# prometheus-b scrape config
scrape_configs:
- job_name: 'prometheus-a'
static_configs:
- targets: ['prometheus-a:9090']
Each instance can alert on the other’s health:
- alert: PeerPrometheusDown
expr: up{job="prometheus-b"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Peer Prometheus instance is down"
This creates mutual supervision — neither instance can fail silently as long as the other is running.
Layer 4: Separate Notification Path
This is the part most teams skip, and it’s the most important:
Meta-alerts must not travel through the same notification path as regular alerts.
If your regular alerts go: Prometheus → AlertManager → Slack webhook → #alerts channel
Your meta-alerts should go: External service → PagerDuty/OpsGenie direct → phone call/SMS
The moment your meta-alert shares infrastructure with your regular alerts, you’ve reintroduced the single point of failure you were trying to eliminate.
Practical Patterns That Actually Work
The examples below assume you route meta-monitoring signals through an independent path. If you point blackbox results into the same Prometheus you’re trying to protect, treat it as a baseline only — not true failure-domain isolation.
AlertManager DeadMansSwitch Routing
route:
receiver: 'default'
routes:
- match:
alertname: Watchdog
receiver: 'deadmansswitch'
repeat_interval: 1m
group_wait: 0s
group_interval: 1m
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'deadmansswitch'
webhook_configs:
- url: 'https://hc-ping.com/your-uuid-here'
send_resolved: false
The Watchdog alert fires every minute to an external heartbeat service. If it stops, you know your pipeline is broken — regardless of what your dashboards say.
Blackbox Probing Your Own Stack
# In prometheus.yml
scrape_configs:
- job_name: 'monitoring-health'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus:9090/-/healthy
- http://alertmanager:9093/-/healthy
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Combined with an alert rule:
- alert: MonitoringComponentDown
expr: probe_success{job="monitoring-health"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Monitoring component {{ $labels.instance }} is unhealthy"
Resource Guarantees for Monitoring Pods
Don’t let the kubelet decide your monitoring is expendable:
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1000m"
Set PriorityClass to ensure monitoring pods survive node pressure:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: monitoring-critical
value: 1000000
globalDefault: false
description: "Priority class for monitoring components"
Monitoring should be among the last things evicted, not the first.
Tying It All Together
Across this three-part series, we’ve covered three dimensions of observability maturity:
Article 1 — Cardinality → Financial discipline Are we paying for metrics that drive decisions?
Article 2 — Absent data → Data reliability Do we know when measurement itself disappears?
Article 3 — Meta-monitoring → System reliability Can our monitoring survive its own failures?
Each layer builds on the previous:
- Cardinality controls keep your meta-monitoring cost-effective (you don’t need high-cardinality labels on watchdog alerts)
absent()rules detect data gaps — but only if the pipeline evaluating them is alive- Meta-monitoring ensures the entire chain is operational, from scrape to notification
Quick Checklist (This Week)
- Create a Watchdog alert and route it to an external heartbeat service
- Probe Prometheus/AlertManager/Grafana health from an independent path
- Add a meta-alert channel that does not share the normal alerting pipeline
The Real Takeaway
In mature engineering organizations, observability isn’t treated as tooling — it’s treated as critical infrastructure.
It deserves:
- Its own SLOs (scrape success rate, alert evaluation latency, notification delivery time)
- Its own incident runbooks
- Its own failure domains, separate from the systems it monitors
- Its own monitoring — yes, monitors all the way down
The paradox is real: the thing that tells you everything is fine can itself silently fail.
The solution isn’t infinite recursion. It’s layered reliability with independent failure domains.
A Watchdog alert. An external heartbeat. A separate notification path.
Three simple patterns that turn your monitoring from a single point of failure into a system worthy of the infrastructure it protects.
If your monitoring goes down and nobody notices, was it ever monitoring?
Publicado originalmente no Medium.