News

Currently, no news are available

Reliability in Modern Cloud Systems

Cloud systems power a large fraction of the computing world today. Ensuring that these systems are correct and performant remains a key challenge that continues to bedevil developers. In this seminar, we will explore various themes around the various forms of reliability in modern cloud systems as well as learn about state-of-the-art strategies for mitigating incidents and understanding issues in modern cloud systems today.

Pre-requisites: Programming 2, Software Engineering Lab (Praktikum)

Recommended: Distributed Systems

Places: 20

Lecture Time: Wednesdays, 2:15pm-3:45pm

Room: TBD

Format

Each lecture will be divided into 2 parts:

- Lecture Part: In this part, the instructors will give a lecture on a specific topic in reliability.

- Discussion Part: In this part we will discuss the assigned reading and the previous week's lecture.

Assignments

All assignments will be based on Blueprint, a toolchain for generating microservice implementations and for exploring the design space of microservices.

Grading

- Assignment 1: 10%

- Assignment 2: 20%

- Assignment 3: 25%

- Assignment 4: 40%

- Participation in Discussion: 5%

Course Schedule

Date

Lecture Details

Readings 

Assignment

Slides

09.04.25

Part 1: Kickoff Meeting

Part 2: From Monoliths to Microservices

N/A

   

16.04.25

**No seminar**

     

23.04.25

Part 1: Paper Discussion

Part 2: The Tail at Scale

Blueprint: A Toolchain for Highly Reconfigurable Microservices

Assignment 1 released

 

30.04.25

Part 1: Paper Discussion

Part 2: Availability (Retries, Timeouts, Replication, Redundancy via Hedging, Sharding)

Tales of The Tail: Past and the Future

   

07.05.25

Part 1: Paper Discussion

Part 2:  The Pillars of Observability

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Assignment 1 due;

Assignment 2 released

 

14.05.25

** No seminar**

     

21.05.25

Part 1: Discussion

Part 2: Of Failures and Incidents

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

   

28.05.25

Part 1: Discussion

Part 2:  Cross System Interaction Failures

What bugs cause production cloud incidents?

Assignment 2 due;

Assignment 3 released

 

04.06.25

Part 1: Discussion

Part 2: Dealing with Metastability (Load Shedding Techniques)

Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

   

11.06.25

Part 1: Discussion

Part 2: Root Cause Analysis

Metastable Failures in the Wild

   

18.06.25

Part 1: Discussion

Part 2: Formal Methods

How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service

Assignment 3 due;

Assignment 4 released

 

25.06.25

Part 1: Discussion

Part 2: Predicting and handling workloads

Executing microservice applications on serverless, correctly

 

Building Reliable Cloud Services Using P# (Experience Report)

   

02.07.25

Part 1: Discussion

Part 2: Hardware Reliability

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

   

09.07.25

Part 1: Data Center Design

Part 2: Discussion

Characterizing Cloud Computing Hardware Reliability

 

RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation

   

16.07.25

Demos and Presentations

 

Assignment 4 due

 

 

 

Privacy Policy | Legal Notice
If you encounter technical problems, please contact the administrators.