Operating a Production System
Once your system is running in production, how do you know if it's healthy? How do you debug a problem when it occurs? Observability is the practice of instrumenting your system to provide data that allows you to understand its internal state from the outside.
A system is observable if you can answer any question about what's happening on the inside just by observing the data it emits, without having to ship new code. Observability is often described as having three main pillars: Metrics, Logging, and Tracing.
What they are: Metrics are numerical measurements of your system's behavior over time. They are typically collected at regular intervals and stored in a time-series database. Metrics are aggregated and give you a high-level overview of the system's health.
Key Types of Metrics:
How they are used: Metrics are primarily used for monitoring and alerting. You can create dashboards to visualize the health of your system at a glance and set up alerts to notify you automatically when a metric crosses a critical threshold (e.g., "alert me if the error rate for the payment service goes above 1% for 5 minutes").
Popular Tools: Prometheus, Grafana, StatsD, Datadog.
What they are: A log is an immutable, timestamped record of a discrete event that happened over time. While metrics tell you what is happening (e.g., the error rate is high), logs tell you why it's happening. They provide the detailed, event-specific context needed for debugging.
Best Practices for Logging:
INFO, WARN, ERROR), the service name, and a JSON payload with the relevant context.How they are used: Logs are used for debugging and root cause analysis. When an alert fires, you can dive into the logs for the affected service and time period to find the specific error messages and stack traces that will help you understand the cause of the problem.
Popular Tools: The ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Graylog.
What it is: In a microservices architecture, a single user request might travel through many different services before a final response is returned. If the request is slow or fails, how do you know where the bottleneck or error occurred? This is the problem that distributed tracing solves.
How it works:
The result is a complete, end-to-end view of the entire request, visualized as a "flame graph" or a waterfall diagram. This allows you to see exactly how long the request spent in each service and where any errors occurred.
How it is used: Tracing is used for performance optimization and debugging complex, multi-service workflows. It's invaluable for identifying latency bottlenecks in a distributed system.
Popular Tools: Jaeger, Zipkin, OpenTelemetry (a standard for generating telemetry data), Datadog APM.
In a system design interview, you don't need to design these observability systems from scratch. However, you should be able to articulate a clear strategy for how you would monitor your system. A good answer would be: "For observability, I would implement the three pillars: I would use Prometheus and Grafana for metrics and alerting, centralize structured logs to an ELK stack for debugging, and use OpenTelemetry and Jaeger for distributed tracing to understand latency in our microservices environment."