Daily Note: Debugging a Production API Flow
Integration test green, staging green, prod failing for 1% of callers. The bug was in a place I would never have guessed.
Today was one of those.
API returns 200 in staging, 500 for ~1% of prod traffic. CloudWatch shows the Lambda timing out at 29.9s. Every integration test green. Every load test green.
Spent two hours blaming the database. It wasn't the database.
Turned out: one of our upstream SaaS integrations started returning responses with a specific header that our HTTP client silently retried on. Three retries × 10s backoff = 30s = Lambda timeout. The 1% was the subset of customers whose data happened to route through a specific vendor edge.
# before
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=Retry(total=3, backoff_factor=1)))
# after
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=Retry(
total=3,
backoff_factor=1,
respect_retry_after_header=False, # the vendor lies
allowed_methods=frozenset(["GET"]), # never retry POSTs
)))
Lesson I keep re-learning: retry policies are a system boundary. Default values are almost always wrong for your system.
Going to bed with a headache but a green dashboard.
Related
Keep reading
Daily Note: Shipping the first blog post from my phone
The meta post. Stood up this blog, wrote this from the couch on my phone. Quick Note works.
Daily Note: TIL — Polly SSML <mark> tags
Polly's SSML <mark> tags emit timing events over the stream. Useful for synchronizing on-screen captions to voice playback.
What I Learned Designing Omnichannel Backend Integrations
Shared intent schema, eventually-consistent conversation state, and why the channel should be the last thing your backend knows about.
Keep going
Where to next?
Browse more technical writing, see the engineering case studies, or reach out directly.