News
Currently, no news are available
Reliability in Modern Cloud Systems
Cloud systems power a large fraction of the computing world today. Ensuring that these systems are correct and performant remains a key challenge that continues to bedevil developers. In this seminar, we will explore various themes around the various forms of reliability in modern cloud systems as well as learn about state-of-the-art strategies for mitigating incidents and understanding issues in modern cloud systems today.
Pre-requisites: Programming 2, Software Engineering Lab (Praktikum)
Recommended: Distributed Systems
Places: 20
Lecture Time: Wednesdays, 2:15pm-3:45pm
Room: TBD
Format
Each lecture will be divided into 2 parts:
- Lecture Part: In this part, the instructors will give a lecture on a specific topic in reliability.
- Discussion Part: In this part we will discuss the assigned reading and the previous week's lecture.
Assignments
All assignments will be based on Blueprint, a toolchain for generating microservice implementations and for exploring the design space of microservices.
Grading
- Assignment 1: 10%
- Assignment 2: 20%
- Assignment 3: 25%
- Assignment 4: 40%
- Participation in Discussion: 5%
Course Schedule
Date |
Lecture Details |
Readings |
Assignment |
Slides |
09.04.25 |
Part 1: Kickoff Meeting Part 2: From Monoliths to Microservices |
N/A |
||
16.04.25 |
**No seminar** |
|||
23.04.25 |
Part 1: Paper Discussion Part 2: The Tail at Scale |
Blueprint: A Toolchain for Highly Reconfigurable Microservices |
Assignment 1 released |
|
30.04.25 |
Part 1: Paper Discussion Part 2: Availability (Retries, Timeouts, Replication, Redundancy via Hedging, Sharding) |
|||
07.05.25 |
Part 1: Paper Discussion Part 2: The Pillars of Observability |
Assignment 1 due; Assignment 2 released |
||
14.05.25 |
** No seminar** |
|||
21.05.25 |
Part 1: Discussion Part 2: Of Failures and Incidents |
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure |
||
28.05.25 |
Part 1: Discussion Part 2: Cross System Interaction Failures |
Assignment 2 due; Assignment 3 released |
||
04.06.25 |
Part 1: Discussion Part 2: Dealing with Metastability (Load Shedding Techniques) |
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems |
||
11.06.25 |
Part 1: Discussion Part 2: Root Cause Analysis |
|||
18.06.25 |
Part 1: Discussion Part 2: Formal Methods |
How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service |
Assignment 3 due; Assignment 4 released |
|
25.06.25 |
Part 1: Discussion Part 2: Predicting and handling workloads |
Executing microservice applications on serverless, correctly Building Reliable Cloud Services Using P# (Experience Report) |
||
02.07.25 |
Part 1: Discussion Part 2: Hardware Reliability |
|||
09.07.25 |
Part 1: Data Center Design Part 2: Discussion |
Characterizing Cloud Computing Hardware Reliability RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation |
||
16.07.25 |
Demos and Presentations |
Assignment 4 due |