Sorry, but the limit for this course is reached (30 students)!
You cannot register for this course anymore.

News

Lecture 6 slides posted

Written on 29.05.25 by Vaastav Anand

Lecture 6 slides have been posted

Lecture 5 slides posted

Written on 24.05.25 by Vaastav Anand

Lecture 5 slides have been posted

Assignment 2 Deadline change

Written on 24.05.25 by Vaastav Anand

Assignment 2 deadline has moved back to Friday, May 30 5pm

No Seminar Today

Written on 14.05.25 by Vaastav Anand

Hi everyone,

This is just a reminder email that there is no seminar today. Our next meeting will be next week on May 21st, 2025.

Assignment 1 grades released

Written on 12.05.25 by Vaastav Anand

Hi everyone,

Assignment 1 grades have been pushed to your private forks in a file called assn1_grade.txt in the luggagsehare folder.

Overall, everyone did a very good job in implementing assignment 1.

Office Hourse Today in Room 105

Written on 12.05.25 by Vaastav Anand

Office Hours today have been shifted to Room 105 due to an ongoing event in 005.

Assignment 2 released

Written on 12.05.25 by Vaastav Anand

Assignment 2 has now been released on gitlab.

You can find the instructions here: Assignment 2

Assignment 1 Deadline and Assignment 2 release

Written on 10.05.25 by Vaastav Anand

Assignment 1 deadline has now passed and your submissions for assignment have now been locked in.

Assignment 2 will be released monday morning.

Happy Weekend!

Lecture 4 Slides posted

Written on 08.05.25 by Vaastav Anand

Lecture 4 slides are now posted on CMS

LSF Registration Deadline

Written on 06.05.25 by Vaastav Anand

Hi all,

It was brought to my attention that the LSF registration deadline is today. If you are taking this course for credit, then please register in the LSF.

You wouldn't be able to receive credit for the course if you miss the registration deadline.

Lecture 3 slides posted

Written on 03.05.25 by Vaastav Anand

Lecture 3 slides have now been posted on CMS

Office Hours Timings and Location

Written on 28.04.25 by Vaastav Anand

Office Hours Timing: Mondays 2pm - 3pm

Location: Room 005, E1 5 (all mondays except 12th May, 2025)

Location on May 12th, 2025: Room 029, E1 5

 

Lecture 2 slides posted

Written on 25.04.25 by Vaastav Anand

Lecture 2 slides are now posted on CMS

Assignment 1 released

Written on 24.04.25 by Vaastav Anand

Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1

Each student should have received an invite to join their own fork of the assignments repository. 

If you did not get an invitation to your own fork of the assignments repo, then it means… Read more

Assignment 1 is now released at: https://gitlab.cs.uni-saarland.de/os/cldrel-25ss/assignments/-/tree/assn1

Each student should have received an invite to join their own fork of the assignments repository. 

If you did not get an invitation to your own fork of the assignments repo, then it means we were unable to find your username in the gitlab system. Please ensure that you have an active gitlab account and then contact the instructors with your account details to get access to your own fork.

Assignment Due Date: 10th May, 2025. 5pm CEST.

Lecture 1 Slides posted

Written on 14.04.25 by Vaastav Anand

Lecture 1 Slides are now posted on the CMS website

Show all

Reliability in Modern Cloud Systems

Cloud systems power a large fraction of the computing world today. Ensuring that these systems are correct and performant remains a key challenge that continues to bedevil developers. In this seminar, we will explore various themes around the various forms of reliability in modern cloud systems as well as learn about state-of-the-art strategies for mitigating incidents and understanding issues in modern cloud systems today.

Pre-requisites: Programming 2, Software Engineering Lab (Praktikum)

Recommended: Distributed Systems

Places: 20

Kickoff Meeting: 14.04.25, Monday 2:15pm-3:45pm

Lecture Time (23.04.25 onwards): Wednesdays, 2:15pm-3:45pm

Lecture Room: 005, E1 5

Office Hours (28.04.25 onwards): Mondays, 2pm-3pm

Office Hours Room: 005, E1 5 on all days except May 12th (Room 029)

Format

Each lecture will be divided into 2 parts:

- Lecture Part: In this part, the instructors will give a lecture on a specific topic in reliability.

- Discussion Part: In this part we will discuss the assigned reading and the previous week's lecture.

Assignments

All assignments will be based on Blueprint, a toolchain for generating microservice implementations and for exploring the design space of microservices.

Grading

- Assignment 1 - Implementing a basic Microservice Application using Blueprint: 10%

- Assignment 2 - Adding Observability to the Application and collecting traces from a workload: 20%

- Assignment 3 - Reproducing a Retry Storm: 25%

- Assignment 4 - Open Ended Project: 40%

- Participation in Discussion: 5%

Course Schedule

Date

Lecture Details

Readings 

Assignment

Slides

09.04.25

** No seminar **

N/A

   

14.04.25

Part 1: Kickoff Meeting

Part 2: From Monoliths to Microservices

    Kickoff Logistics, Lecture 1

23.04.25

Part 1: Paper Discussion

Part 2: The Tail at Scale

Blueprint: A Toolchain for Highly Reconfigurable Microservices

Assignment 1 released

Lecture 2

30.04.25

Part 1: Paper Discussion

Part 2: Reliability Basics

Tales of The Tail: Past and the Future

  Lecture 3

07.05.25

Part 1: Paper Discussion

Part 2:  The Pillars of Observability

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

 

Lecture 4
10.05.25 Assignment 1 Submission Deadline  

Assignment 1 Deadline: 5pm CEST

 
12.05.25 Assignment 2 released   Assignment 2  

14.05.25

** No seminar**

     

21.05.25

Part 1: Discussion

Part 2: Of Failures and Incidents

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

  Lecture 5

28.05.25

Part 1: Discussion

Part 2:  Cross System Interaction Failures

What bugs cause production cloud incidents?

  Lecture 6
30.05.25 Assignment 2 Due; Assignment 3 released      

04.06.25

Part 1: Discussion

Part 2: Dealing with Metastability (Load Shedding Techniques)

Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

Metastable Failures in the Wild

   

11.06.25

Part 1: Discussion

Part 2: Root Cause Analysis

TBD

   

18.06.25

Part 1: Discussion

Part 2: Testing & Formal Methods

How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service

Assignment 3 due;

Assignment 4 released

 

25.06.25

Part 1: Discussion

Part 2: Predicting and handling workloads

Executing microservice applications on serverless, correctly

 

Building Reliable Cloud Services Using P# (Experience Report)

   

02.07.25

Part 1: Discussion

Part 2: Hardware Reliability

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms

   

09.07.25

Part 1: Data Center Design

Part 2: Discussion

Characterizing Cloud Computing Hardware Reliability

 

RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation

   

16.07.25

Demos and Presentations

 

Assignment 4 due

 

 

 

Privacy Policy | Legal Notice
If you encounter technical problems, please contact the administrators.