Projects

Below are some projects that our group is actively working on. If you are a Ph.D. student, email Sanidhya.

Our group also has some specific semester and optional projects for students at EPFL.

Transient Operating System Design

The main goal of this big project is to dynamically modify various subsystems to cater to heterogeneous hardware and varying application requirements. Most of the prior works focus on IO, while our focus is mostly on the concurrency aspect. In particular, we are exploring how applications can fine-tune the concurrency control mechanisms and underlying stack to improve their performance. Some of the projects are as follows:

A concurrency control runtime to efficiently switch between locks at various granularity.
New low-level language to support lock design while ensuring lock properties, such as mutual exclusion, starvation avoidance, and fairness.
A lightweight hypervisor that caters to various forms of virtualization: bare-metal to serverless.
Re-architecting OS for microsecond IO.

We will further extend this project to reason about data structures' concurrency and consistency.

Scalable Storage Stack

With blazing fast IO devices, saturating them is becoming a difficult task. Unfortunately, the current OS stack is the major bottleneck that is still operating in the era of 2000s. As a part of this big project, we are looking at ways to redesign the OS stack to support fast storage devices. We are working on designing new ways to improves the design of file systems. Some of the projects are as follows:

Designing new techniques to saturate and scale operations for various storage media.
Understanding the implication of storage class memory over traditional storage media, such as SSDs.
Designing new storage engines for upcoming storage media, such as ZNS SSDs.
Offloading file system stack to computational SSDs.

Concurrency Primitives and Frameworks

With our particular interest in designing new synchronization primitives and concurrency frameworks, we are looking at designing new primitives that further squeeze the performance out of hardware for two scenarios: heterogeneous hardware (such as BIG/Little architectures and high bandwidth memory) and rack-scale systems. We are revisiting some of the primitives and trying to reason about their practicality. Some of the ongoing projects are as follows:

Revisiting the design of locking primitives for very large multicore machines.
Redesigning concurrency primitives for microsecond scale application in a rack scale environment.
Reasoning about various bugs in a concurrent environment.

Projects for Bachelors and Masters students

[Memory Management] Memory allocators in tiered memory (Musa Unal)

Memory management is a critical component of operating systems. Modern data centers rely on different types of memory (with different latency and bandwidth) to manage data, which is also referred as tiered memory. This project aims to understand how profiling can help identify optimal allocation sites in tiered memory.

In this project, you will:

Learn how PGO (Profile-guided optimization) can help an application to improve it's performance.
Learn how memory allocators works and manage the data.
Understand how CXL (Compute Express Link) works.

Prerequisite:

Have a basic understanding of how operating systems work.
Understanding C/C++

References:

[Memory Management] Energy consumption in tiered memory (Musa Unal)

Memory management is a critical component of operating systems. Modern data centers rely on different types of memory (with different latency and bandwidth) to manage data, which is also referred as tiered memory. This project aims to understand the trade-offs between energy consumption and performance in tiered memory.

In this project, you will:

How Linux's memory tiering system works.
Understand how to measure energy consumption in memory tiering.
Understand how CXL (Compute Express Link) works.

Prerequisite:

Have a basic understanding of how operating systems work.
Understanding C/C++

References:

[Scalable OS] Admission control for system calls (Vishal Gupta)

Multi-threaded applications increasingly use system calls to access shared resources (Network, IO, CPU, Memory). In current design, system calls done in parallel are admitted and it contends for resources. Increasing this parallelism of system calls however results in diminishing returns. This project aims to analyze this threshold and implement an admission control mechanism for system calls.

In this project, you will:

Figure out thresholds for a set of system calls after which it results in diminishing returns.
Implement an admission control mechanism and use it to implement the admission control policy based on threshold.

Prerequisite:

Comfortable in exploring large code-base like the Linux kernel.
Have a basic understanding of how operating systems work.

References:

https://faculty.cc.gatech.edu/~amsmti3/assets/protego-nsdi23.pdf

[Scalable OS] Continuous lock switching across Userspace and Kernel space (Vishal Gupta)

Applications are increasingly becoming more complex and are being deployed on heterogeneous hardware (NUMA, AMP etc.). However, in current systems, the lock mechanism remains static. SynCord is the first framework to implement dynamic lock switching for the kernel. This work proposes to extend the framework for userspace applications. The end goal is to create a holistic framework to change any locks across userspace and kernel space.

In this project, you will:

Implement dynamic lock switching for userspace applications.
Implement an interface to implement lock policies across userspace and kernel space.
Implement a mechanism to enforce lock policies across the stack.

Prerequisite:

Comfortable in exploring large codebases like the Linux kernel.
Have a basic understanding of how operating systems work.

References:

https://faculty.cc.gatech.edu/~amsmti3/assets/protego-nsdi23.pdf

[Scalable OS] Which lock is the best? (Vishal Gupta)

Applications are becoming increasingly complex and are being deployed on heterogeneous hardware (NUMA, AMP, etc.). A single lock design is not optimal in all cases. A wide range of lock algorithms have been proposed. This work proposes to analyze a set of algorithms and figure out which algorithms work best for which scenarios.

In this project, you will:

Analyze a set of lock algorithms across different workloads and different hardware (Intel / AMD / ARM / NUMA / AMP).
Create a set of static or dynamic rules to determine which algorithm is optimal for a given workload on a given hardware.

Prerequisite:

Comfortable in exploring performance anomalies across different hardware/software.

References:

https://faculty.cc.gatech.edu/~amsmti3/assets/protego-nsdi23.pdf

[Scalable OS] Faster Uprobes using User Mode eBPF (Kumar Kartikeya Dwivedi)

BPF based tracing is used to execute programs and collect data when certain functions are triggered in the kernel. The same is possible in user space using the ‘user probes’ feature, where programs are executed when USDT probes are triggered within the user space applications. However, currently the implementation of uprobes requires trapping in to the kernel whenever an event occurs, leading to slowdowns in applications, and being up to 2x slower than system call context switches. This project explores whether the inception of a new ‘user mode eBPF’ mode and making uprobes execute such program types in user space will be faster and have the same level of usability. The ideal end goal would be to attain transparent 100% compatibility with the current uprobe mechanism.

In this project, you will:

Develop a deep understanding of the eBPF verifier’s static analysis process.
Create a new ‘user mode eBPF’ program type for eBPF.
Measure and benchmark usability and performance differences between the current uprobe mechanism and the one based on user mode eBPF.

Prerequisite:

A basic level of understanding of eBPF.
Proficiency in C and Python.

References:

[Systems for ML] Improving the performance of ML workloads (Yueyang Pan)

ML workloads are at the center stage of the 21st century computing evolution. However, software that runs them is not entirely efficient. Thus, to efficiently utilize current hardware, whether for inference or training, within a single machine or across machines, we need to understand and redesign the current software stack.

In this project, you will:

Analyze and understand the overhead current software stack.
Do a complete breakdown of the cost associated within a single machine and across machines for both inference and training.
Propose a set of optimizations that improves such systems performance.

Prerequisite:

Using existing ML software.
Basics about ML algorithms.

You will learn:

Understanding the performance of software systematically.

[Program Analysis] Safe and Automated Program Partitioning for Kernel Offloading (Tao Lyu)

Modern operating systems, such as the Linux kernel, manage to provide generality and security protection for userspace applications. Inevitably, this design deprives the direct communication between applications and hardware, leading to heavy costs because of context switches, copying between userspace and kernel, and redundant kernel abstractions for applications.

To address this challenge, we propose the automated program partitioning for transparent and safe kernel offloading to avoid these costs. This approach—a push-button performance improvement approach—will empower developers to enhance their application performance without manual effort.

Keywords: program partition, kernel offloading with eBPF

Starting time: Summer or fall 2025

Prerequisite:

This research project is for a master's thesis or for students looking for long-term research projects.
The students should be self-motivated and aim to conduct complete research (e.g., baseline design + evaluation).
The students should have strong coding ability and know basic usage of eBPF.

References:

In case you have projects that are not mentioned above but fall under the purview of our group's interest, feel free to contact us.