Recruitment and knowledge question base. Filter, search and test your knowledge.
Logs show discrete events and context, metrics show aggregated trends over time, and traces show end-to-end request paths across services. Use logs for details, metrics for health/alerting, and traces for latency and dependency analysis.
An SLI (Service Level Indicator) is a measurable signal of service health, like latency, error rate, or availability. Define it based on user outcomes with clear measurement windows and thresholds.
Alert on user-visible symptoms tied to SLOs, use multi-window burn-rate alerts, and ensure every alert has an owner and runbook. Deduplicate and route alerts to the right team.
Distributed tracing tracks a request across services using trace/span IDs. Context is propagated via headers (e.g., W3C traceparent) or messaging metadata so every service can attach spans to the same trace.
High-cardinality labels (e.g., userId) explode metric series. Avoid them in metrics, use aggregation or bucketing, and move per-entity details to logs or traces.
Sampling keeps only a subset of traces to control cost. It reduces storage and overhead but can hide rare failures or edge cases, so sampling strategy matters.
Start with metrics to confirm scope (p95/p99, endpoints, regions), then use traces to locate slow spans and logs to identify exact errors or queries. Compare recent deploys and config changes.
At minimum: RED metrics (rate, errors, duration), saturation (CPU/memory), dependency health, and SLO burn-rate. Include breakdowns by route, region, and version.
MTTR (Mean Time To Recovery) measures how fast you restore service after an incident. Improve it with clear runbooks, faster detection, better rollback tooling, and well-practiced incident response.
RED (Rate, Errors, Duration) is best for request-driven services. USE (Utilization, Saturation, Errors) is best for resources like CPUs, disks, or queues. Together they cover service health and resource bottlenecks.