ICSE 2019
Sat 25 - Fri 31 May 2019 Montreal, QC, Canada

Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle component failures and slowdowns. This paper describes the platform and our experiences operating it.

Wed 29 May
11:00 - 12:30: Papers - Controlled Experiments of Production Software
Chair(s): Yvonne DittrichIT University of Copenhagen, Denmark
Aleksander FabijanMicrosoft, Pavel, Helena Holmström OlssonMalmö University, Jan BoschChalmers University of Technology, Sweden, Lukas, Dylan LewisIntuit
Tong XiaMicrosoft, Sumit BhardwajMicrosoft, Pavel, Aleksander FabijanMicrosoft
Paul Luo LiMicrosoft, Pavel, Huibin Mary HuMicrosoft, Xiaoyu ChaiMicrosoft, Zoran DimovMicrosoft, Brandon PaddockMicrosoft, Ying LiMicrosoft, Alex KirshenbaumMicrosoft, Irina NiculescuMicrosoft, Taj ThoresenMicrosoft
Ali Basiri, Lorin Hochstein, Nora Jones, Haley Tucker
