Observabilitymedium

How do you investigate a latency regression in production?

Tags

#latency#incident#tracing

Back to category Practice quiz

Answer

Start with metrics to confirm scope (p95/p99, endpoints, regions), then use traces to locate slow spans and logs to identify exact errors or queries. Compare recent deploys and config changes.

Advanced answer

Deep dive

A structured workflow saves time:

Confirm impact: user-facing vs internal, % of traffic.
Slice by dimension: endpoint, region, tenant, version.
Trace bottlenecks: DB, cache, downstream, queue.
Correlate with deploys, feature flags, or traffic shifts.

Examples

Regression checklist:

1) p95/p99 up? 2) Which routes? 3) Which version? 4) Trace slow spans
5) Check DB: slow query log / locks / cache hit ratio

Common pitfalls

Looking only at average latency (hides tail issues).
Blaming the last deploy without evidence.
Ignoring downstream dependency status.

Interview follow-ups

How do you decide between rollback vs fix-forward?
How do you test for latency regressions pre-prod?
What if traces are missing for the affected traffic?

Related questions

What is sampling in tracing and what are the trade-offs?

#tracing#sampling#cost

What is distributed tracing and how do you propagate context?

#tracing#context#distributed-systems

Logs vs metrics vs traces — when do you use each?

#observability#logs