Open to Engineering Manager / Director rolesLet's connect
Cases/We Had 400 Alerts and Missed the One That Mattered
Observability

We Had 400 Alerts and Missed the One That Mattered

The on-call engineer received 400+ alerts per week. When a real incident started — a slow memory leak that would eventually OOM-kill the primary database — the alert was buried in noise. The outage lasted 4 hours. The alert had fired 90 minutes earlier.

What's at stake
  • Primary database heading toward OOM failure with customer data at risk
  • On-call engineer averaging 400+ alerts per week, most non-actionable
  • Alert fired 90 minutes before the outage but was never seen

The monitoring system worked. It detected the problem with 90 minutes of lead time — more than enough to prevent the outage. But the alerting system had trained the on-call team to ignore it. The failure wasn't detection. It was attention.

The Scenario

You're the engineering manager responsible for platform reliability. After a 4-hour database outage, the post-mortem reveals: the alert fired 90 minutes before the OOM. The on-call engineer had 47 unread alerts at the time. Average weekly alert volume is 400+. Leadership wants a plan to make sure this doesn't happen again. What do you propose?

No hints. Just judgment.

The common mistake

Better alerting platforms can group and deduplicate alerts, which reduces the volume the on-call engineer sees. But if the underlying alerts aren't actionable, you've organized the noise without improving the signal. Grouped non-actionable alerts are still non-actionable. The investment in tooling feels productive but doesn't change the fundamental ratio of signal to noise.

Lessons
  • Every post-mortem adds alerts — build a process that also removes them
  • Alert fatigue is a system design problem, not a willpower problem
  • The metric that matters is not alert volume but signal-to-noise ratio
  • Monitoring that doesn't require immediate action should be a dashboard, not a page
  • On-call engineers will always optimize for their own sanity — design the system so that optimization aligns with reliability
Impact
  • Weekly alert volume reduced by over 90%
  • Mean time to acknowledge real incidents dropped from 47 minutes to under 5 minutes
  • Zero missed real alerts in the quarter following the audit
  • Quarterly alert review process adopted by three additional teams
← Back to all cases