Blogs (1) >>
ICSE 2019
Sat 25 - Fri 31 May 2019 Montreal, QC, Canada

Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle component failures and slowdowns. This paper describes the platform and our experiences operating it.

Wed 29 May

Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30
Controlled Experiments of Production SoftwareSoftware Engineering in Practice / Papers at St-Denis / Notre-Dame
Chair(s): Yvonne Dittrich IT University of Copenhagen, Denmark
11:00
20m
Talk
Three Key Checklists and Remedies for Trustworthy Analysis of Online Controlled Experiments at ScaleSEIPIndustry Program
Software Engineering in Practice
Aleksander Fabijan Microsoft, Pavel Dmitriev Outreach.io, Helena Holmström Olsson Malmö University, Jan Bosch Chalmers University of Technology, Sweden, Lukas Vermeer Booking.com, Dylan Lewis Intuit
11:20
20m
Talk
Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled RolloutSEIPIndustry Program
Software Engineering in Practice
Tong Xia Microsoft, Sumit Bhardwaj Microsoft, Pavel Dmitriev Outreach.io, Aleksander Fabijan Microsoft
11:40
20m
Talk
Experimentation in the Operating System: The Windows Experimentation PlatformSEIPIndustry Program
Software Engineering in Practice
Paul Luo Li Microsoft, Pavel Dmitriev Outreach.io, Huibin Mary Hu Microsoft, Xiaoyu Chai Microsoft, Zoran Dimov Microsoft, Brandon Paddock Microsoft, Ying Li Microsoft, Alex Kirshenbaum Microsoft, Irina Niculescu Microsoft, Taj Thoresen Microsoft
12:00
20m
Talk
Automating chaos experiments in productionSEIPIndustry Program
Software Engineering in Practice
Ali Basiri Netflix, Lorin Hochstein Netflix, Nora Jones Netflix, Haley Tucker Netflix
Pre-print
12:20
10m
Talk
Discussion Period
Papers