We Had 400 Alerts and Missed the One That Mattered

The on-call engineer received 400+ alerts per week. When a real incident started — a slow memory leak that would eventually OOM-kill the primary database — the alert was buried in noise. The outage lasted 4 hours. The alert had fired 90 minutes earlier.

What's at stake

Primary database heading toward OOM failure with customer data at risk
On-call engineer averaging 400+ alerts per week, most non-actionable
Alert fired 90 minutes before the outage but was never seen

The monitoring system worked. It detected the problem with 90 minutes of lead time — more than enough to prevent the outage. But the alerting system had trained the on-call team to ignore it. The failure wasn't detection. It was attention.

The Scenario

You're the engineering manager responsible for platform reliability. After a 4-hour database outage, the post-mortem reveals: the alert fired 90 minutes before the OOM. The on-call engineer had 47 unread alerts at the time. Average weekly alert volume is 400+. Leadership wants a plan to make sure this doesn't happen again. What do you propose?

No hints. Just judgment.

Cases/We Had 400 Alerts and Missed the One That Mattered

Observability

We Had 400 Alerts and Missed the One That Mattered

What's at stake

Primary database heading toward OOM failure with customer data at risk
On-call engineer averaging 400+ alerts per week, most non-actionable
Alert fired 90 minutes before the outage but was never seen

The Scenario

No hints. Just judgment.

The common mistake

Better alerting platforms can group and deduplicate alerts, which reduces the volume the on-call engineer sees. But if the underlying alerts aren't actionable, you've organized the noise without improving the signal. Grouped non-actionable alerts are still non-actionable. The investment in tooling feels productive but doesn't change the fundamental ratio of signal to noise.

Lessons

Every post-mortem adds alerts — build a process that also removes them
Alert fatigue is a system design problem, not a willpower problem
The metric that matters is not alert volume but signal-to-noise ratio
Monitoring that doesn't require immediate action should be a dashboard, not a page
On-call engineers will always optimize for their own sanity — design the system so that optimization aligns with reliability

Impact

Weekly alert volume reduced by over 90%
Mean time to acknowledge real incidents dropped from 47 minutes to under 5 minutes
Zero missed real alerts in the quarter following the audit
Quarterly alert review process adopted by three additional teams