Topics

Topic 1: Hybrid ML models for the segmentation of materials science image data

When analyzing material microstructures, datasets often exhibit high variability due to differences in chemical composition, manufacturing process parameters, and microscope settings. This variability poses a significant challenge for machine learning (ML) methods, particularly in tasks such as image classification and segmentation. While metadata associated with microstructure images—such as composition, processing conditions, and imaging parameters—can explain a substantial portion of this variability, it is typically not incorporated into ML models. Leveraging this additional information has the potential to improve model robustness and performance in various applications. The objective of this student project is to develop a hybrid segmentation model based on a U-Net architecture that integrates both image data and tabular metadata. Different strategies for fusing image features with metadata will be implemented and systematically evaluated. The performance of these hybrid approaches will be compared against a baseline U-Net model that relies solely on image data.

Data is available and will be made accessible.

Project Donor: Martin Müller & Marie Stiefel (Materials Design – Data-Driven Materials Design)

-------------------------------------------------------------------------------------------------------------

Topic 2: Enhancing Dataset Diversity in Microscopy Images via Generative Augmentation

This project investigates how to improve the diversity of microscopy image datasets using a combination of feature-space analysis and generative models. The provided dataset consists of images of the same sample region acquired with different microscopes and imaging settings, enabling controlled analysis of variability introduced by acquisition conditions.

First, images shall be embedded into a latent feature space using pretrained or self-supervised models. Based on this representation, the students shall analyze dataset coverage and identify underrepresented regions (“empty spaces”) using distance- or density-based methods.

Second, a one-to-many image translation model shall be developed to generate realistic variations of a given image that mimic different imaging conditions. The goal is to assess whether such generative augmentations provide greater diversity than classical techniques such as rotation or flipping, and whether they help fill gaps in feature space. Finally, the impact of these augmentation strategies on downstream tasks such as segmentation or classification will be evaluated.

Data is available and will be made accessible.

Project Donor: Björn Bachmann & Marie Stiefel (Materials Design – Data-Driven Materials Design)

-------------------------------------------------------------------------------------------------------------

Topic 3: The impact of recent gas price surges and policy responses on mobility patterns in Germany

The student team will investigate whether and how consumers adapt their refuelling behavior to the planned legal reform according to which gas stations may adjust their prices only once per day. Based on Google Live Busyness data for gas stations, the project will analyze whether visits become more concentrated at certain times of day, whether the intra-daily and weekly seasonal pattern changes, and whether there are differences across station or location types. Depending on the final data coverage, potential substitution effects with other mobility-related locations could also be explored.

Google Live Busyness data is available and will be made accessible. Gas prices must be collected by the students.

Project Donor: Fabian Hollstein & Jan Piontek (Economics – Quantitative Methods)

-------------------------------------------------------------------------------------------------------------

Topic 4: Best of both worlds? Integrating Computational Social Science and Political Survey Research

The contemporary public sphere is increasingly structured by digital communication on platforms such as social media. However, political science research still largely relies either on traditional survey-based approaches or on large-scale digital trace data, rarely combining both in a systematic way. This creates a gap between observable political behaviour online and self-reported political attitudes. This project explores how digital trace data (such as interactions, engagement patterns or network relations on online platforms) could be integrated with traditional survey-based measures of political attitudes and values.

Rather than collecting individual-level behavioural data, the project will focus on the conceptual and methodological challenges of linking these different data sources. Possible tasks include developing a research design for combining digital trace data with survey measures, analysing existing examples from computational social science, and exploring methodological approaches such as network analysis or machine learning–based content classification.

A central component of the project may be the design of a detailed research protocol addressing the ethical and legal requirements of such studies, including the preparation of a hypothetical ethics (ERB) application for research involving digital trace data and survey data integration. Because the collection of individual-level digital trace data requires consent procedures and ethics approval, the project will focus on conceptual design and methodological exploration rather than the implementation of a full data donation setup.

Project Donor: Philipp König (Political Science)

-------------------------------------------------------------------------------------------------------------

Topic 5: Between “L-takes” and “Löwengebrüll”: Mapping Ideological Structures in the Network of German Streamers and Influencers

Social media platforms such as YouTube or Twitch have produced a new class of influential public actors: streamers and influencers who reach large audiences and increasingly comment on political topics. While political communication research traditionally focuses on parties, journalists or political elites, much less is known about how these digital content creators shape political discourse and how they relate to each other within platform ecosystems.

This project aims to map the political communication and network structures of influencers and streamers using publicly available platform data. One possible approach is to focus on creators who explicitly produce political content and analyse their ideological positioning through content analysis and Natural Language Processing (NLP). Students could collect platform data (e.g. through APIs or scraping), analyse engagement metrics such as views, likes and follower counts, and construct collaboration or interaction networks based on reactions, joint content, or references between creators. Methods such as clustering, centrality analysis and topic modelling may be used to identify ideological clusters, discursive topics, and patterns of cooperation.

Relevant data is largely publicly accessible via platform APIs (e.g. YouTube) or other web-based data collection methods, making the project technically feasible if the scope is limited to one platform.

Project Donor: Philipp König (Political Science)

-------------------------------------------------------------------------------------------------------------

Topic 6: Microtentacle-positive cells detection -- A quantification

Scientific Problem: Following the formation of a primary tumor, cancer cells in the outer regions can detach and undergo an epithelial-to-mesenchymal (EMT) transition, enabling them to migrate and colonize new tissues, ultimately leading to metastasis. Circulating tumor cells (CTCs) play a key role during invasion. A crucial step towards invasion is the extravasation of CTCs from the bloodstream, which occurs only after the cells have attached to the blood vessel walls. Recent work suggests that such adhesion is mediated by microtubule (MT)-based membrane protrusions, known as microtentacles (McTNs). However, it remains unclear how McTNs protrude from the cell and how they facilitate cell adhesion. In our group, we combine different types of microscopies to analyze McTNs formation. One of the main limitations of our research is processing large amounts of data in a short time, and the inability to automate time-consuming tasks. McTNs can be easily identifiedusing fluorescence microscopy. However, this technique is time-consuming. We require a tool to identify and quantify McTN-positive cells in large cell populations using only bright-field microscopy. This represents a major challenge, as bright field offers little contrast compared to fluorescence microscopy, and McTNs are small and thin structures.

We have a significant amount of already available data, and more will be produced regularly. All data is available to students.

Commitment to supervision: Enrique Colina, alongside two other members of the project, is committed to meeting with the students and assisting them.

Project Donor: Prof. Dr. Franziska Lautenschläger & Enrique Colina Araujo (Experimental Biophysics)

-------------------------------------------------------------------------------------------------------------

Topic 7: Cost-Aware Machine Learning for Accelerated Drug Discovery

In the early stages of drug discovery, scientists use a process called Virtual Screening (VS) to search through massive digital libraries of billions of molecules to find "hits"-chemicals that might bind to a target protein and treat a disease. To predict how well a molecule fits into a protein, researchers use Docking Workflows: combinations of different physical simulations (docking tools) and statistical models (scoring functions).

However, a major bottleneck exists: some simulations are highly accurate but computationally "expensive" (taking minutes per molecule), while others are fast but less reliable. In a compound library (containing billions of compounds), running every simulation on every molecule is impossible. This project implements a regularized model to solve this optimization problem. By treating docking tools and scoring functions as interacting components in a matrix, the model learns which specific combinations are most predictive of actual biological activity. We apply a unique Cost-Weighted L1 Penalty during training, which forces the model to "prune" the workflow, automatically identifying the Pareto Optimal pipeline-the specific subset of tools that provides the highest accuracy for the lowest computational cost.

Once the optimal, high-efficiency pipeline is identified, the project extends into active learning. The model acts as an "intelligent agent" to explore the Enamine Representative Set. Instead of brute-force searching, the system iteratively selects the most promising molecules to simulate, uses those results to refine its internal logic, and rapidly homes in on potential drug hits. This approach demonstrates how ML can transform drug discovery from an exhaustive search into a targeted, budget-conscious strategy, drastically reducing the time and capital required to find the next generation of medicines.

Data is available and will be made accessible.

Project Donor: Andrea Volkamer & Michael Backenköhler (Data-Driven Drug Design)

-------------------------------------------------------------------------------------------------------------

Topic 8: Investigating Non-i.i.d. Effects in Tennis

Tennis professionals align on the believe that the sport is highly affected by psychological phenomena such as clutch performance (better players perform well in important situations) or choking under pressure (some players perform worse in important situations). This narrative gets underpinned by the fact that the best tennis players won on average only little more than 50% of their points (top 10 ATP player = 53.4%, top 10 WTA players = 54.1%) but over 75% of their career matches (top 10 ATP player = 78.2%, top 10 WTA players = 78.3%). This behavior gives rise to the notion that the top players must differentiate themselves from the average through other factors such as winning the important points. In opposition to this observation, a lot of studies assume that points are independent and identically distributed (i.i.d.) to predict tennis matches. Although, some studies take these psychological side- effects as self-evident, we demonstrated that by simulating through thousands of player careers that the simulated match-win percentages (based on an i.i.d. model) mirrored the observed match-win percentages relatively stable. We also showed that up to 95.8% of match-win percentage variance could be explained by the players career point win percentage using linear regression. However, we also showed that the relationship between a players serve and return rate and the match-win percentage follows a rather sigmoid relationship. Therefore, the aim of this project is to rebuilt the models in and implement a non-linear modeling approach.

Data is available from Jeff Sackmanns Github repository.

Project Donor: Pascal Bauer & Luis Holzhauer (Sports Analytics)

-------------------------------------------------------------------------------------------------------------

Topic 9: Quantify the relative role of skill and chance in determining match outcome of badminton and table tennis competitions

Sports feature compelling suspense—a blend of skill and luck (chance). A purely skill-determined competition would lack dramatic tension, whereas a luck-dominated league would feel like meaningless randomness. This delicate skill–luck balance shapes competition attractiveness and motivates governing bodies to optimize competition formats that reward stronger teams while pre- serving enough variance to keep spectators engaged. A variance decomposition method that separates match-outcome variance into skill and chance components was previously established to measure this balance. Because this established approach splits a season chronologically (e.g., first and second half), it assumes a balanced schedule with opponent strength evenly distributed over time. This assumption may be violated in unbalanced schedules or certain competition formats, such as single-elimination (knockout) tournaments (e.g., Olympic badminton and table tennis competitions). To extend the method to general sports, we have explored an adaptation that incorporates competitor-strength models, enabling opponent-strength-adjusted decomposition under different competition formats. Nevertheless, the relative role of skill and chance in racket sports, such as badminton and table tennis, remains largely unknown. Altogether, the objective of this project is to apply or adapt the updated variance decomposition method to badminton and table tennis competitions.

Data is available through the Sportsradar API for both badminton and table tennis. In particular, the API provides access to season summaries, match meta- data (e.g., date and location), athlete information (e.g., rankings and points), as well as match results (e.g., final results and set-by-set scores). In terms of coverage, the currently accessible dataset includes more than 470 competitions, more than 620 seasons, and more than 23,000 matches for badminton and table tennis. Exported data can also be shared with students in a structured format for further analysis.

Project Donor: Pascal Bauer, Minho Lee & Guangze Zhang (Sports Analytics)

-------------------------------------------------------------------------------------------------------------

Topic 10: Surface-Specific Prediction of Serve and Return Performance in Tennis

Player performance in tennis varies substantially across various court surfaces due to differences in ball speed, bounce characteristics, and surface dynamics. Hard courts, clay courts, and grass courts produce different match conditions, which influence serve effectiveness, rally length, and player movement patterns. As a result, modeling player performance at the level of serve and return rates can provide an intuitive representation of player’s ability. Measuring serve and return rates —in general or per-surface —can be estimated from historical match data and serve as key inputs for higher-level models such as match outcome prediction or in-game win probability models. Previous work has shown that incorporating contextual factors such as court surface can improve the estimation of these performance measures [1, 5, 10]. The objective of this task is to predict serve and return win probabilities for a given match between player A and player B on a specific surface. You will be provided with a dataset containing match-level information, including player identifiers, match outcome, serve and return statistics, and court surface type. The goal is to evaluate whether incorporating surface information improves predictive performance compared to surface-agnostic approaches. To guide the implementation, you may start with a simple baseline approach inspired by [2], where player-specific serve and return rates are adjusted relative to tournament level averages on a given surface. For example, a player’s expected serve performance can be estimated by combining their historical serve win rate with the average return performance of their opponent, normalized by the tournament average on that surface. As an extension, the prediction task can also be formulated as a supervised learning problem, where the target variable is the observed serve or return point-win rate in a match, and input features may include player-specific historical averages (global and surface-specific), opponent statistics, and contextual variables such as court surface type.

The task is structured into two steps. In the first step, you should estimate player-specific serve and return performance using historical averages, ignoring opponent-specific information. In the second stage, you should extend this baseline by incorporating opponent information from historical matches. The main objectives you should cover in this task are:

• Compute historical serve and return rates for each player, both globally and per-surface type

• Predict serve and return performance for unseen matches using player-specific computed historical averages

• Evaluate predictive performance with and without surface information

• Extend the baseline by incorporating opponent strength (Hint: See Elo-based methods)

• Compare the added value of opponent aware models relative to simple historical baseline

Data is available from Jeff Sackmanns Github repository.

Project Donor: Pascal Bauer & Abdelrahman Abdelsamad (Sports Analytics)

-------------------------------------------------------------------------------------------------------------

Topic 11: Multi-Aspect Network Performance Analysis of Multi-Parameter Measurements in a 5G campus network

5G offers a myriad of parameters that can be tweaked, which makes configuring networks complex. We are exploring the effects of 5G parameters on network performance and want to make the measurement results accessible for human examination to increase understanding, enable judgment of parameter effects, and conclude beneficial configurations.

The starting point for the project is a data set from end-to-end network performance measurement series in which multiple parameters of the underlying 5G network have been varied, and the task is to make the results from this measurement series visually accessible with respect to multiple performance metrics, e.g., delay and loss. The project can be extended to both the data acquisition and analysis sides. The second could be to expand the set of metrics or enhance individual measurement analysis.

Results of end-to-end network performance measurement series in which multiple 5G campus network configurations have been changed can be made available. The results comprise output from a custom measurement tool. A tool for basic statistical evaluation and visualization of individual measurements also exists

Project Donors: Thorsten Herfet, Marlene Böhmer & tbd (Telecommunications)

-------------------------------------------------------------------------------------------------------------

Topic 12: Auditing platform algorithms using the pro-choice / pro-life posts on Instagram

In March 2025, Meta announced that an AI system was being used to personalize comments shown to Instagram users. This system takes different input signals, such as the likelihood of a user to report, delete, reply, or scroll past a comment, to decide how they should be ranked. With extensive research showing how feed personalization systems lead to societal problems, we can assume that this new ranking system would also increase the presence of echo chambers and polarization. For example, on a post about abortion rights, Republicans might see comments supporting conservative pro-life beliefs, while Democrats might see more comments supporting pro-choice. Or the algorithm could do the opposite to increase engagement on the platform. To explore this, the project aims to look at the differences in comments seen by users of different demographics on posts discussing contentious topics such as abortion rights.

Tasks for the group project include collecting around 500 posts containing hashtags related to abortion rights and using existing scrapers to collect top comments on each post from various dummy Instagram accounts. To simulate different users, these dummy Instagram accounts would be created with different gender and political orientations. Post the data collection, the comments should then be analysed to understand the extent of difference between users, and if this variation can be systematically explained by the differences in demographics of the user accounts, using statistical and regression models.

Setup for data collection is tested and will be provided.

Project Donor: Ingmar Weber & Brahmani Nutakki (Societal Computing)

-------------------------------------------------------------------------------------------------------------

Topic 13: Quantum Machine Learning for Small Classification Tasks

This project explores whether concepts from quantum machine learning can be meaningfully applied to small, supervised learning problems. The goal is to implement and evaluate simple quantum-inspired or hybrid quantum–classical models on benchmark classification datasets. Students will first familiarize themselves with the basic ideas behind quantum computing that are relevant for machine learning, including qubits, parameterized quantum circuits, and the distinction between generative and discriminative learning tasks. They will then use existing Python frameworks to build and test a small prototype (for example a variational quantum classifier) and compare its performance to a classical baseline on a toy dataset.

Data is available and will be made accessible.

Project Donor: Giovanna Morigi & Maximilian Kiefer (Theoretical Quantum Physics)

-------------------------------------------------------------------------------------------------------------

Topic 14: Tensor-Network-Based Adaptation Methods for Efficient AI Models

This project investigates tensor-network methods, highly successful and established in Physics, as parameter-efficient tool for adapting machine learning models to new tasks. Recent work suggests that tensor-structured adaptations of high-rank fine-tuning, can surpass the limitations of Low-Rank Adaptation (LoRA), especially when downstream tasks require more expressive updates. In this project, students will study the idea of tensor-based adaptation in a simplified and accessible setting and implement a small prototype in Python, where code is available https://github.com/quanta-fine-tuning/quanta. The practical goal is to compare a tensor-network-inspired adaptation method against a standard low-rank baseline on a manageable task, such as text classification, sentiment analysis, or another lightweight benchmark.

Data is available and will be made accessible.

Project Donor: Giovanna Morigi & Maximilian Kiefer (Theoretical Quantum Physics)

-------------------------------------------------------------------------------------------------------------

Topic 15: Analysing open survey questions - How computer science can help to use their full potential

Surveys in social sciences mostly focus on questions with fixed options for answers. While this reduces complexity and cognitive load for respondents, it also hides a lot of information. Open survey questions can remedy this problem, but they are often time-consuming to analyse, which is why not many surveys include them.

This project aims to discover new ways to analyse respondents' answers to open survey questions. One could first try to replicate classical analytic methods (e.g. word clouds), before they may try to provide new ways of gaining insight from responses to open survey questions with methods from computer science. This could include grouping respondents based on their responses, try to extract different degrees of hostility in responses, or find new ways to gain valuable insights from open survey questions.

This project will use data from the Austrian National Election Study (AUTNES), a publicly available survey, containing respondents' answers to several open survey questions.

Project Donor: Martin Ulrich (Political Science)

-------------------------------------------------------------------------------------------------------------

Topic 16: Investigating Inequality Arguments from Historic Newspapers

Is it fair that some have more than others? Some would say yes, arguing that social inequality can be based on merit. Yet others find that social inequality is excessive or unfair. We aim to trace how such social justice views have changed over decades and centuries, by analyzing how newspapershistorically frame capitalism and the inequality it creates.

The project therefore aims to build and evaluate an efficient, scalable pipeline for identifying and counting arguments about fair and factual inequality in German newspaper articles using Large Language Models. The sociological target is defined by an existing codebook specifying justice dimensions, sub-dimensions, and examples.

From a data science perspective, the main challenge is reliable large-scale processing: students should implement a batch analysis workflow (e.g. via the OpenAI API or via open-source models from Hugging Face) that can handle long articles, minimize cost and latency, and produce structured, machine-readable outputs (e.g. JSON with argument counts per codebook category). Key engineering tasks include efficient text preprocessing, document segmentation strategies for long inputs, parallelization and rate-limit-aware batching, caching and deduplication, robust error handling (retries, timeouts), and experiment tracking for reproducibility. The pipeline can be evaluated on a manually coded subset and with consistency checks (repeated runs, agreement metrics) to assess reliability of LLM-based coding relative to the codebook.

Basic data is available and will be provided. Additional data needs to be scraped from newspapers.

Project Donors: Martin Schröder & Sharif el-Masry (Sociology)

Project Seminar Data Science and Artificial Intelligence

Topics