Summary:
Slack has migrated its critical user-facing services from a monolithic to a cell-based architecture to increase redundancy and limit the impact of site failures. During a recent outage caused by a network disruption in one availability zone, the company realized the complexity of detecting failures in distributed systems. To mitigate such failures, Slack developed a solution called AZ draining, which redirects traffic away from an affected availability zone. This approach allows systems within an availability zone to quiesce naturally when no new requests are received. The implementation of AZ draining involves the use of Envoy and weighted clusters.
Controversial information: None.
Surprising/unique/clever content: Slack’s solution to mitigating availability zone failures involves tapping into human judgment by allowing engineers to redirect traffic away from the affected zone. This approach, called AZ draining, eases the complexity of automatic remediation and enables quick recovery.
https://slack.engineering/slacks-migration-to-a-cellular-architecture/