Below are some projects that our group is actively working on. If you are a Ph.D. student, email Sanidhya.
Our group also has some specific semester and optional projects for students at EPFL.
Transient Operating System Design
The main goal of this big project is to dynamically modify various subsystems to cater to heterogeneous hardware and varying application requirements. Most of the prior works focus on IO, while our focus is mostly on the concurrency aspect. In particular, we are exploring how applications can fine-tune the concurrency control mechanisms and underlying stack to improve their performance. Some of the projects are as follows:
- A concurrency control runtime to efficiently switch between locks at various granularity.
- New low-level language to support lock design while ensuring lock properties, such as mutual exclusion, starvation avoidance, and fairness.
- A lightweight hypervisor that caters to various forms of virtualization: bare-metal to serverless.
- Re-architecting OS for microsecond IO.
We will further extend this project to reason about data structures' concurrency and consistency.
Scalable Storage Stack
With blazing fast IO devices, saturating them is becoming a difficult task. Unfortunately, the current OS stack is the major bottleneck that is still operating in the era of 2000s. As a part of this big project, we are looking at ways to redesign the OS stack to support fast storage devices. We are working on designing new ways to improves the design of file systems. Some of the projects are as follows:
- Designing new techniques to saturate and scale operations for various storage media.
- Understanding the implication of storage class memory over traditional storage media, such as SSDs.
- Designing new storage engines for upcoming storage media, such as ZNS SSDs.
- Offloading file system stack to computational SSDs.
Concurrency Primitives and Frameworks
With our particular interest in designing new synchronization primitives and concurrency frameworks, we are looking at designing new primitives that further squeeze the performance out of hardware for two scenarios: heterogeneous hardware (such as BIG/Little architectures and high bandwidth memory) and rack-scale systems. We are revisiting some of the primitives and trying to reason about their practicality. Some of the ongoing projects are as follows:
- Revisiting the design of locking primitives for very large multicore machines.
- Redesigning concurrency primitives for microsecond scale application in a rack scale environment.
- Reasoning about various bugs in a concurrent environment.
Projects for Bachelors and Masters students
[Scalable OS] Admission control for system calls (Vishal Gupta)
Multi-threaded applications increasingly use system calls to access shared resources (Network, IO, CPU, Memory). In current design, system calls done in parallel are admitted and it contends for resources. Increasing this parallelism of system calls however results in diminishing returns. This project aims to analyze this threshold and implement an admission control mechanism for system calls.
In this project, you will:
- Figure out thresholds for a set of system calls after which it results in diminishing returns.
- Implement an admission control mechanism and use it to implement the admission control policy based on threshold.
Prerequisite:
- Comfortable in exploring large code-base like the Linux kernel.
- Have a basic understanding of how operating systems work.
References:
[Scalable OS] Continuous lock switching across Userspace and Kernel space (Vishal Gupta)
Applications are increasingly becoming more complex and are being deployed on heterogeneous hardware (NUMA, AMP etc.). However, in current systems, the lock mechanism remains static. SynCord is the first framework to implement dynamic lock switching for the kernel. This work proposes to extend the framework for userspace applications. The end goal is to create a holistic framework to change any locks across userspace and kernel space.
In this project, you will:
- Implement dynamic lock switching for userspace applications.
- Implement an interface to implement lock policies across userspace and kernel space.
- Implement a mechanism to enforce lock policies across the stack.
Prerequisite:
- Comfortable in exploring large codebases like the Linux kernel.
- Have a basic understanding of how operating systems work.
References:
[Scalable OS] Which lock is the best? (Vishal Gupta)
Applications are becoming increasingly complex and are being deployed on heterogeneous hardware (NUMA, AMP, etc.). A single lock design is not optimal in all cases. A wide range of lock algorithms have been proposed. This work proposes to analyze a set of algorithms and figure out which algorithms work best for which scenarios.
In this project, you will:
- Analyze a set of lock algorithms across different workloads and different hardware (Intel / AMD / ARM / NUMA / AMP).
- Create a set of static or dynamic rules to determine which algorithm is optimal for a given workload on a given hardware.
Prerequisite:
- Comfortable in exploring performance anomalies across different hardware/software.
References:
[Scalable OS] Faster Uprobes using User Mode eBPF (Kumar Kartikeya Dwivedi)
BPF based tracing is used to execute programs and collect data when certain functions are triggered in the kernel. The same is possible in user space using the ‘user probes’ feature, where programs are executed when USDT probes are triggered within the user space applications. However, currently the implementation of uprobes requires trapping in to the kernel whenever an event occurs, leading to slowdowns in applications, and being up to 2x slower than system call context switches. This project explores whether the inception of a new ‘user mode eBPF’ mode and making uprobes execute such program types in user space will be faster and have the same level of usability. The ideal end goal would be to attain transparent 100% compatibility with the current uprobe mechanism.
In this project, you will:
- Develop a deep understanding of the eBPF verifier’s static analysis process.
- Create a new ‘user mode eBPF’ program type for eBPF.
- Measure and benchmark usability and performance differences between the current uprobe mechanism and the one based on user mode eBPF.
Prerequisite:
- A basic level of understanding of eBPF.
- Proficiency in C and Python.
References:
[Systems for ML] Improving the performance of ML workloads (Yueyang Pan)
ML workloads are at the center stage of the 21st century computing evolution. However, software that runs them is not entirely efficient. Thus, to efficiently utilize current hardware, whether for inference or training, within a single machine or across machines, we need to understand and redesign the current software stack.
In this project, you will:
- Analyze and understand the overhead current software stack.
- Do a complete breakdown of the cost associated within a single machine and across machines for both inference and training.
- Propose a set of optimizations that improves such systems performance.
Prerequisite:
- Using existing ML software.
- Basics about ML algorithms.
You will learn:
- Understanding the performance of software systematically.
[Robustness] Understanding semantic bugs in distributed applications (Tao Lyu)
Distributed applications, such as Zookeeper, HDFS and redis, are prone to semantic bugs due to the complex programming logics. However, detecting these bugs can be challenging as they occur silently without causing any process crashes. To address this, one approach is to manually review commits/patches or bug reports to identify the patterns of specific semantic bug types in a group of applications. This information can then be used to design checkers that can detect these bugs.
In this project, you will:
- Collect semantic bugs of the popular distributed applications.
- Analyze their root causes and patterns.
- If time permits, develop the corresponding checkers.
Prerequisite:
- Bug analysis (static and dynamic analysis)
- Basics about distributed applications
References:
- https://dl.acm.org/doi/pdf/10.1145/2872362.2872374
- https://dl.acm.org/doi/pdf/10.1145/3296957.3177161
- https://dl.acm.org/doi/pdf/10.1145/3093337.3037735
In case you have projects that are not mentioned above but fall under the purview of our group's interest, feel free to contact us.