Below are some projects that our group is actively working on. If you are a Ph.D. student, email Sanidhya.

Our group also has some specific semester and optional projects for students at EPFL.

Transient Operating System Design

The main goal of this big project is to dynamically modify various subsystems to cater to heterogeneous hardware and varying application requirements. Most of the prior works focus on IO, while our focus is mostly on the concurrency aspect. In particular, we are exploring how applications can fine-tune the concurrency control mechanisms and underlying stack to improve their performance. Some of the projects are as follows:

  1. A concurrency control runtime to efficiently switch between locks at various granularity.
  2. New low-level language to support lock design while ensuring lock properties, such as mutual exclusion, starvation avoidance, and fairness.
  3. A lightweight hypervisor that caters to various forms of virtualization: bare-metal to serverless.

We will further extend this project to reason about data structures' concurrency and consistency.

Scalable Storage Stack

With blazing fast IO devices, saturating them is becoming a difficult task. Unfortunately, the current OS stack is the major bottleneck that is still operating in the era of 2000s. As a part of this big project, we are looking at ways to redesign the OS stack to support fast storage devices. We are working on designing new ways to improves the design of file systems. Some of the projects are as follows:

  1. Designing new techniques to saturate and scale operations for various storage media.
  2. Understanding the implication of persistent memory (PM) over traditional storage media, such as SSDs.
  3. Offloading file system stack to computational SSDs.
  4. Re-architecting OS for microsecond IO.

Concurrency Primitives and Frameworks

With our particular interest in designing new synchronization primitives and concurrency frameworks, we are looking at designing new primitives that further squeeze the performance out of hardware for two scenarios: heterogeneous hardware (such as BIG/Little architectures and high bandwidth memory) and rack-scale systems. We are revisiting some of the primitives and trying to reason about their practicality. Some of the ongoing projects are as follows:

  1. Making advanced synchronization primitives (combining/delegation) a reality.
  2. Revisiting the design of locking primitives for very large multicore machines.
  3. Reasoning about various bugs in a concurrent environment.

Projects for Bachelors and Masters students

[Concurrency] Scalable Range Locks

Range locks are one of the concurrency primitives that enable disjoint data access parallelism within a range. Because of this, they are now becoming a part of file system and memory management operations in the Linux kernel. The aim of the project is to design and implement a new primitive that provides the best of both worlds: low memory footprint and high throughput in both low and high contention.

In this project, you will:

  • Profile the filesystem and memory-management subsystem to understand the bottlenecks in current range lock implementation.
  • Implement a range lock and compare the performance with other state-of-the-art range lock designs.

You will learn about:

  • Synchronization primitives.
  • Benchmarking using software and hardware performance counter.
  • File system and memory management subsystem.

[Robust Systems] Developing benchmarks for metastable failures

Metastable failures is a class of (mostly) distributed systems bugs that manifest as a consequence of the interaction of various microservices in an application: when in fact the individual components by themselves do not exhibit these failures. This class failures are mostly applicable to datacenter scale applications. One key aspect of this class of failures is that systems that enter a metastable failure state, usually cannot recover from it without human intervention/system reboots. Moreover, they can be cascading in nature causing outages for hours. In this project, we would like to develop benchmarks to study these bugs better and come up with resolution mechanisms.

In this project, you will:

  • Study open reports of major outages at companies running applications at datacenter scale.
  • Write a set benchmarks implemented as microservices with configurations that can trigger metastable failure states.

You will learn:

  • Distributed systems.
  • Dockers, k8s, microservice architecture, cloud application design.
  • Bug detection, testing and debugging.

[Robust Systems] Checking semantic correctness of concurrent file system operations

Semantic incorrectness or semantic bugs (e.g, specification violation, logical bugs, and crash consistency bugs) of file systems can lead to data loss, file system unavailability, etc. Thus it's critical to discover and fix them. There are existing works related to semantic correctness checking. For example, SibylFS checks whether the results of a sequence of file operations are allowed by a mathematically rigorous model, which expresses the POSIX specification. And Hydra emulates the allowed file system states when facing a crash to discover crash consistency bugs. However, they only focus on the single execution, which leaves the concurrent executions unexplored.

In this project, you will:

  • Collect existing semantic bugs caused by concurrent file operations.
  • Analyze the difference of semantic bugs between single and concurrent execution scenarios.
  • Propose and implement a method to emulate file system states in the concurrent execution scenario.
  • Compare the real running results with emulated results to decide the correctness.

You will learn:

  • Semantic bug analysis and detection.
  • Distributed file systems.
  • Fuzzing.

[Storage] Understanding new storage devices

Storage devices comes in various forms and factors. For instance, today's machines are already equipped with flash based storage devices, such as SSDs. However, there are several performance issues with the stack supporting current SSDs, mostly because of how SSDs have been designed. To avoid some set of problems, there are new devices coming in the market, such as Zoned Namespace SSDs (ZNS), SmartSSDs, and persistent memory (PM). Hence, this projects aims to understand the performance of storage stacks for these new storage media.

In this project, you will:

  • Evaluate the performance characteristics of any of these devices.
  • Design a set of benchmarks that specifically target such devices for various scenarios.
  • Contrast their performance with SSDs.

You will learn:

  • File system design.
  • Scalability of storage stacks.
  • Behavior of existing storage hardware.

[Storage] Efficient memory copy for PM file systems

Persistent memory (PM) is a high-performant storage device that is faster than other storage media. The project will investigate a simple yet very important memory operation: memory copying. In particular, how developers can use various advanced hardware features, such as SIMD instruction sets and I/OAT DMA for copying data.

In this project, you will:

  • Study the performance characteristics of PM.
  • Develop mechanisms based on existing hardware features to perform efficient memory copying with PM.

You will learn:

  • Advanced hardware features for memory copy.

[Storage] NVMe dummy device backed by memory for Linux NVMe driver

One branch of our research is focused on improving performance of the Linux kernel storage stack. However, this overhead is not significant with traditional SSDs (NAND-based), because their latency is too high. Hence we must use very expensive and now discontinued Intel Optane DC SSDs (3D XPoint-based) to actually benchmark our improvements.

It means, that we cannot realistically measure the bottlenecks in the storage stack until we have a new, faster storage devices. But researchers don't want to wait for new hardware to appear. We want to have fast devices even before they appear to be able to modify the software stack properly.

NVMe dummy device backed by main memory would allow us to have a virtual SSD with extremely low latency and extremely high throughput. With that device, we can build storage stack for the future.

In this project, you will:

  • Study Linux kernel driver for NVMe devices.
  • Implement a virtual NVMe device backed by main memory.
  • Optimize it for the lowest possible latency/highest possible throughput.
  • Benchmark its performance.

You will learn:

  • How to read a codebase of Linux kernel.
  • How to work with high-quality real-life code.
  • Necessary tools and practices for Linux kernel development.
  • How to write high-quality and high-performance code.
  • How to contribute to Linux kernel.

In case you have projects that are not mentioned below but fall under the purview of our group's interest, feel free to contact us.