Architecturehard

What makes a good alert and how do you avoid alert fatigue?

Answer

Good alerts are actionable and user-impact focused (symptom-based), with clear severity and a runbook link. Avoid alert fatigue by reducing noisy alerts, using proper thresholds, grouping, and paging only on real incidents (use error budgets).

Advanced answer

Deep dive

Expanding on the short answer — what usually matters in practice:

Context (tags): alerting, runbook, observability, sre
Scaling: what scales horizontally vs vertically, where bottlenecks appear.
Reliability: retries/circuit breakers/idempotency, observability (logs/metrics/traces).
Evolution: keep changes cheap (boundaries, contracts, tests).
Explain the "why", not just the "what" (intuition + consequences).
Trade-offs: what you gain/lose (time, memory, complexity, risk).
Edge cases: empty inputs, large inputs, invalid inputs, concurrency.

Examples

A tiny example (an explanation template):

// Example: discuss trade-offs for "what-makes-a-good-alert-and-how-do-you-avoid-ale"
function explain() {
  // Start from the core idea:
  // Good alerts are actionable and user-impact focused (symptom-based), with clear severity an
}

Common pitfalls

Too generic: no concrete trade-offs or examples.
Mixing average-case and worst-case (e.g., complexity).
Ignoring constraints: memory, concurrency, network/disk costs.

Interview follow-ups

When would you choose an alternative and why?
What production issues show up and how do you diagnose them?
How would you test edge cases?