Open to Engineering Manager / Director rolesLet's connect
Case Studies

Decisions under pressure.

Real engineering scenarios. Choose a path. See the consequences. Understand the reasoning.

Incident Response

A Minor Dependency Update Broke Production for 12 Hours

A routine patch update to a date-formatting library changed its locale handling. The change was semver-compliant. Tests passed. The bug shipped to production and silently corrupted date-sensitive financial reports for 12 hours.

Enter this case
Observability

We Had 400 Alerts and Missed the One That Mattered

The on-call engineer received 400+ alerts per week. When a real incident started — a slow memory leak that would eventually OOM-kill the primary database — the alert was buried in noise. The outage lasted 4 hours. The alert had fired 90 minutes earlier.

Enter this case
Performance

We Built a Cache That Made the System Slower

A team added a Redis caching layer to speed up a slow API endpoint. Response times got worse. The cache was working perfectly — it was caching the wrong thing.

Enter this case
Incident ResponseFeatured

Production Pods Were Restarting Randomly

A production incident involving intermittent connection failures and unstable recovery behavior. The fix was surgical. The real lesson was about what not to do first.

Enter this case
Architecture

A Database Migration Took Down the Entire Platform

A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours. The migration itself was correct. The deployment strategy was the failure.

Enter this case
Architecture

We Split the Monolith and Made Everything Worse

A team extracted a billing service from a monolith to improve deploy velocity. Deploys got faster. Everything else got slower, harder to debug, and more fragile. The architecture was right. The boundary was wrong.

Enter this case
Incident Response

A Feature Flag We Forgot About Caused a Production Incident

A feature flag created 18 months ago was still in the codebase. When the flag provider had a timeout, the flag evaluated to its default value — which no longer matched the state of the system. The result was a data corruption bug that took three days to fully remediate.

Enter this case
Performance

GraphQL Performance Was Deteriorating

API response times were climbing. The database looked guilty. The real culprit was an N+1 query pattern hiding in plain sight — and the instinct to scale made it worse.

Enter this case
Security

Security Vulnerabilities Were Accumulating in Our GraphQL Stack

Our GraphQL libraries were 2+ years outdated with active CVEs. The fast fix was obvious. The right fix was harder to justify — until you see what the fast fix actually leaves behind.

Enter this case
ArchitectureFeatured

The Cloud Migration That Almost Broke Our Export Service

We had three weeks to migrate file storage from AWS S3 to Azure before our AWS contract renewed. The codebase had a clean storage abstraction built for exactly this scenario. We almost shipped without checking whether everything actually used it.

Enter this case