
One of the biggest problems that teams managing large-scale distributed systems face is alert noise. Getting precise signals when something is wrong in production services is critical for maintaining the stability of production systems since it enables teams to reduce the time to mitigate issues that impact customers and helps uphold the SLAs promised to customers. In the era of AI, where anyone can write and ship code, reliability becomes a key differentiator for companies. Effective alerting is one of the important aspects of improving and maintaining reliability. In this article, I describe a set of tools and processes that can be incorporated to improve alerting effectiveness for large-scale distributed systems. Treat alerting configuration as production code Teams typically apply rigorous engineering practices to production code, including unit testing, integration testing, formal code reviews, version control, and pull requests. Applying the same discipline to alerting configurations can significantly improve alert quality, reduce false positives, and make your monitoring system far more reliable. Use automation as the first line of defense Traditional alerting follows a familiar pattern: when a metric crosses a predefined threshold, an alert is triggered and the on-call engineer is paged. The engineer then follows a runbook to diagnose the issue and apply the appropriate mitigation. A more effective approach is to make automation the first line of defense. Automated remediation can resolve many common issues before a human is ever notified, reducing both time to mitigation and the operational burden on on-call engineers. The on-call engineer should only be paged if the automated actions fail to restore the service or require human intervention. Every alert should be actionable An alert is as good as its actionability. When an alert fires, the on-call engineer should be able to understand what is broken, why it is broken, what immediate action should be taken, and who is responsible. To provide this information, every alert should have proper context, such as links to relevant dashboards, clear remediation steps, outlined escalation procedures, and the contact information of the responsible team to which it needs to be escalated. The runbooks included in the alerts should be regularly reviewed to make sure they are not outdated. Review & tune your alerts periodically Every alert that exists in the production environment would have been created for a specific reason. It could have been for a new feature, a repair item for a production incident, etc. However, production systems evolve over a period of time with deployments, configuration changes, new features, and deprecated features. The assumptions for the alert would have changed. It is critical to review alerts periodically to assess the validity of the alert or the underlying conditions and see if the alert thresholds need to be adjusted, or the scope of the alert needs to be changed, or if the alert has to be deprecated. This review can be a weekly or monthly review. Have mechanisms to suppress alerts You may have the perfect alerting system that is noise-free. However, sometimes you may still get valid alerts, but you may need to ignore them for a period of time. For example, you may have a planned production changes which will trigger the alerts which may trigger alerts for expected conditions. Another example is that you may get alerts due to a bug, which could take a while to fix and deploy to production. During this time, you may need to ignore the alerts for a while. Having a mechanism to suppress these alerts will help reduce the expected noise and let your on-call engineers focus on genuine issues. Combine multiple metrics to make alerting effective Individual alerts can become misleading in certain cases. Implementing logic to combine multiple signals into a single alert can be effective instead. For example, a spike in system resources such as CPU, memory, etc might be usual during a surge in traffic. However, if there are signals of increased errors or latency can indicate that the CPU or memory spike is problematic for the service's health. Combining these two signals can provide accurate alerts instead of using a single signal. Gain a deeper understanding of your system Understanding the service you manage more deeply can be invaluable in creating effective alerts. SRE’s often focus on non-functional aspects of the service. However, spending time on understanding function aspects of the service can help create an effective alerting system for the service. For example, if you are managing a WebRTC system where latency is paramount to the user experience. Understanding how various components interact with each other, and where the latency bottlenecks can arise, helps you to devise alerts at the individual subsystem level to catch these bottlenecks when they arise with a proper alerting strategy. Find gaps in your alerting While dealing with false positives is extremely important, it is equally important to address false negatives. The impact of not catching issues in production before your customers experience degradation or unavailability of the service can impact SLA’s. Use Chaos Engineering to find gaps Netflix pioneered the concept of Chaos Engineering. Chaos Engineering is the practice of proactively injecting controlled failures such as server crashes, network failures, etc., into production systems to understand and identify systematic weaknesses before they can turn out to be outages. We can use this chaos engineering testing as an opportunity to validate if all the alerts that were supposed to fire during the testing actually fired. If there are any gaps, such as improper thresholds or missing alerts, they can be fixed before any real outages occur. Never let an incident go to waste It’s a best practice to have blame-free post-incident reviews after an incident occurs in production. These reviews should be utilized to review the alerts to verify if the alerts fired effectively and in a timely manner, whether the alerts fired were actionable, and whether the alerts had enough context with them. Any issues or gaps with the alerts should be promptly fixed so that any future incidents are caught in a timely manner and addressed effectively. Conclusion There is no silver bullet to create effective alerts, but following this set of best practices will help to reduce the noise and get precise signals with alerts, which will help catch issues in production in a timely manner, reduce time to mitigate issues, and uphold SLOs. \ \
View original source — Hacker Noon ↗
