Open to Engineering Manager / Director rolesLet's connect
Cases/A Feature Flag We Forgot About Caused a Production Incident
Incident Response

A Feature Flag We Forgot About Caused a Production Incident

A feature flag created 18 months ago was still in the codebase. When the flag provider had a timeout, the flag evaluated to its default value — which no longer matched the state of the system. The result was a data corruption bug that took three days to fully remediate.

What's at stake
  • Stale feature flag controlling a write path to a financial data pipeline
  • Flag provider outage caused the flag to evaluate to a default nobody remembered setting
  • Data inconsistency propagated for 6 hours before detection

Feature flags are treated as temporary by design but permanent by neglect. This flag was supposed to be removed after a two-week rollout 18 months ago. Instead, it became invisible load-bearing infrastructure — and when it failed, it failed silently into a state that corrupted downstream data.

The Scenario

You're the tech lead at a fintech company. An alert fires: financial transaction records don't match between two systems. Investigation reveals a feature flag that's been in the codebase for 18 months is suddenly evaluating to its default value because the flag provider is timing out. The default routes transactions through a deprecated code path that writes data in an old format. It's been running this way for 6 hours. What do you do?

No hints. Just judgment.

The common mistake

Hardcoding the flag and deploying immediately is the right instinct — stop the bleeding first. But deploying without establishing the corruption boundary first means the remediation team has no clean scope for recovery. They end up diffing entire datasets instead of querying a precise time window. Two minutes of scoping before the fix saves days of recovery work after.

Lessons
  • Feature flags are temporary by design but permanent by neglect — enforce expiration
  • A flag's default value is a silent contract with the future — it must remain valid or the flag must be removed
  • Establish the corruption boundary before deploying the fix — two minutes of scoping saves days of recovery
  • Flag provider failures should fail safe, not fail to a default that may no longer be correct
  • Stale flags are invisible technical debt until they become visible incidents
Impact
  • Data remediation completed in 8 hours instead of an estimated 3 days
  • 23 stale feature flags identified and removed across the codebase
  • Flag expiration policy implemented — all new flags require a review date
  • Flag provider SDK configured with fallback cache to prevent timeout-driven defaults
← Back to all cases