Observability

Recruitment and knowledge question base. Filter, search and test your knowledge.

Topics

Logs vs metrics vs traces — when do you use each?

easyobservabilitylogsmetrics+1

Answer

Logs show discrete events and context, metrics show aggregated trends over time, and traces show end-to-end request paths across services. Use logs for details, metrics for health/alerting, and traces for latency and dependency analysis.

What is an SLI and how do you define one?

mediumslisloreliability

Open question

Answer

An SLI (Service Level Indicator) is a measurable signal of service health, like latency, error rate, or availability. Define it based on user outcomes with clear measurement windows and thresholds.

How do you design actionable alerts to reduce noise?

hardalertingslooncall

Open question

Answer

Alert on user-visible symptoms tied to SLOs, use multi-window burn-rate alerts, and ensure every alert has an owner and runbook. Deduplicate and route alerts to the right team.

What is distributed tracing and how do you propagate context?

mediumtracingcontextdistributed-systems

Open question

Answer

Distributed tracing tracks a request across services using trace/span IDs. Context is propagated via headers (e.g., W3C traceparent) or messaging metadata so every service can attach spans to the same trace.

How do you handle high-cardinality labels/tags in metrics?

hardmetricscardinalitylabels

Open question

Answer

High-cardinality labels (e.g., userId) explode metric series. Avoid them in metrics, use aggregation or bucketing, and move per-entity details to logs or traces.

What is sampling in tracing and what are the trade-offs?

mediumtracingsamplingcost

Open question

Answer

Sampling keeps only a subset of traces to control cost. It reduces storage and overhead but can hide rare failures or edge cases, so sampling strategy matters.

How do you investigate a latency regression in production?

mediumlatencyincidenttracing

Open question

Answer

Start with metrics to confirm scope (p95/p99, endpoints, regions), then use traces to locate slow spans and logs to identify exact errors or queries. Compare recent deploys and config changes.

What dashboards are must-have for a critical API?

easydashboardsredslo

Open question

Answer

At minimum: RED metrics (rate, errors, duration), saturation (CPU/memory), dependency health, and SLO burn-rate. Include breakdowns by route, region, and version.

How do you measure and improve MTTR?

mediummttrincident-responsereliability

Open question

Answer

MTTR (Mean Time To Recovery) measures how fast you restore service after an incident. Improve it with clear runbooks, faster detection, better rollback tooling, and well-practiced incident response.

Explain the RED and USE methodologies and when to use them.

easyredusemetrics

Open question

Answer

RED (Rate, Errors, Duration) is best for request-driven services. USE (Utilization, Saturation, Errors) is best for resources like CPUs, disks, or queues. Together they cover service health and resource bottlenecks.