The paper discusses the impact of Silent Data Corruption (SDC) on large-scale infrastructure services, highlighting how these errors are not traceable at the hardware level but manifest as application-level problems. The authors describe common defect types in silicon manufacturing that lead to SDCs and present a real-world example within a datacenter application. They detail the debug flow used to identify faulty instructions within a CPU, along with mitigations to reduce the risk of SDCs. Surprisingly, the authors have tested a vast library of silent error scenarios across hundreds of thousands of machines, detecting hundreds of CPUs with these errors. Reducing SDCs requires hardware resiliency, production detection mechanisms, and fault-tolerant software architectures.
https://arxiv.org/abs/2102.11245