2025
@inproceedings{yujie:polystore, title = {{PolyStore: Exploiting Combined Capabilities of Heterogeneous Storage}}, author = {Yujie Ren and David Domingo and Jian Zhang and Paul John and Rekha Pitchumani and Sanidhya Kashyap and Sudarsun Kannan}, booktitle = {23rd USENIX Conference on File and Storage Technologies (FAST) (FAST 25)}, month = feb, year = 2025, address = {Santa Clara, CA}, }
2024
The ability to safely extend OS kernel functionality is a long-standing goal in OS design, with the widespread use of the eBPF framework in Linux and Windows demonstrating the benefits of such extensibility. However, existing solutions for kernel extensibility (including eBPF) are limited and constrain users either in the extent of functionality that they can offload to the kernel or the performance overheads incurred by their extensions.
We present KFlex: a new approach to kernel extensibility that strikes an improved balance between the expressivity and performance of kernel extensions. To do so, KFlex separates the safety of kernel-owned resources (e.g., kernel memory) from the safety of extension-specific resources (e.g., extension memory). This separation enables KFlex to use distinct, bespoke mechanisms to enforce each safety property—automated verification and lightweight runtime checks, respectively—which enables the offload of diverse functionality while incurring low runtime overheads.
We realize KFlex in the context of Linux. We demonstrate that KFlex enables users to offload functionality that cannot be offloaded today and provides significant end-to-end performance benefits for applications. Several of KFlex’s proposed mechanisms have been upstreamed into the Linux kernel mainline, with efforts ongoing for full integration.
@inproceedings{dwivedi:kflex, title = {{Fast, Flexible, and Practical Kernel Extensions}}, author = {Kumar Kartikeya Dwivedi and Rishabh Iyer and Sanidhya Kashyap}, booktitle = {Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP)}, month = nov, year = 2024, address = {Austin, TX}, }
Distributed file systems (DFSes) are prone to bugs.
Although numerous bug-finding techniques have been applied to DFSes,
static analysis does not scale well with the sheer complexity of DFS codebases
while dynamic methods (e.g., regression testing) are limited by the quality of test cases.
Although both can be improved by pouring in manual effort,
they are less practical when facing a diverse set of real-world DFSes.
Fuzzing, on the other hand, has shown great success in local systems.
However, several problems exist if we apply existing fuzzers
to DFSes as they
1) cannot test multiple components of DFSes holistically;
2) miss the critical testing aspects of DFSes (e.g., distributed faults);
3) have not yet explored the practical state representations as fuzzing feedback; and
4) lack checkers for asserting semantic bugs unique to DFSes.
In this paper,
we introduce Monarch,
a multi-node fuzzing framework
to test all POSIX-compliant DFSes under one umbrella.
Monarch pioneers push-button fuzzing for DFSes
with a new set of building blocks to the fuzzing toolbox:
1) A multi-node fuzzing architecture for testing diverse DFSes from a holistic perspective;
2) A two-step mutator for testing DFSes with syscalls and faults;
3) Practical execution state representations with a unified coverage collection scheme across execution contexts;
4) A new DFSes semantic checker SYMSC.
We applied Monarch to six DFSes and uncovered a total of 48 bugs,
including a bug whose existence can be traced back to the initial release of the DFSes.
@inproceedings{tao:monarch, title = {{Monarch: A Fuzzing Framework for Distributed File Systems}}, author = {Tao Lyu and Liyi Zhang and Zhiyao Feng and Yueyang Pan and Yujie Ren and Meng Xu and Mathias Payer and Sanidhya Kashyap}, booktitle = {Proceedings of the 2024 USENIX Annual Technical Conference (ATC)}, month = jul, year = 2024, address = {Santa Clara, CA}, }
@inproceedings{yan:nfos, title = {{Transparent Scaling of Single-Threaded Software Network Functions}}, author = {Lei Yan and Yueyang Pan and Diyu Zhou and George Candea and Sanidhya Kashyap}, booktitle = {Proceedings of the 19th European Conference on Computer Systems (EuroSys)}, month = apr, year = 2024, address = {Athens, Greece}, }
2023
Userspace library file systems (LibFSes) promise to unleash the performance potential of non-volatile memory (NVM) by directly accessing it and enabling unprivileged applications to customize their LibFSes to their workloads. Unfortunately, such benefits pose a significant challenge to ensuring metadata integrity. Existing works either underutilize NVM’s performance or forgo critical file system security guarantees.
We present Trio, a userspace NVM file system architecture that resolves this inherent tension with a clean decoupling among file system design, access control, and metadata integrity enforcement. Our key insight is that other state (i.e., auxiliary state) in a file system can be regenerated from its "ground truth" state (i.e., core state). Thus, Trio explicitly defines the data structure of a single core state and shares it as common knowledge among its LibFSes and the trusted entity. Enabled by this, a LibFS can directly access NVM without involving the trusted entity and can be customized with its private auxiliary state. The trusted entity enforces metadata integrity by verifying the core state of a file when its write access is transferred from one LibFS to another. We design a generic POSIX-like file system called ArckFS and two customized file systems based on the Trio architecture. Our evaluation shows that ArckFS outperforms existing NVM file systems by 3.1× to 17× on LevelDB while the customized file systems further outperform ArckFS by 1.3×.
@inproceedings{zhou:trio, title = {{Enabling High-Performance and Secure Userspace NVM File Systems with the Trio Architecture}}, author = {Diyu Zhou and Vojtech Aschenbrenner and Tao Lyu and Jian Zhang and Sudarsun Kannan and Sanidhya Kashyap}, booktitle = {Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2023, address = {Koblenz, Germany}, }
Today's high-performance applications heavily rely on various synchronization mechanisms, such as locks. While locks ensure mutual exclusion of shared data, their design impacts application scalability. Locks, as used in practice, move the lock-guarded shared data to the core holding it, which leads to shared data transfer among cores. This design adds unavoidable critical path latency leading to performance scalability issues. Meanwhile, some locks avoid this shared data movement by localizing the access to shared data on one core, and shipping the critical section to that specific core. However, such locks require modifying applications to explicitly package the critical section, which makes it virtually infeasible for complicated applications with large code bases, such as the Linux kernel.
We propose transparent delegation, in which a waiter automatically encodes its critical section information on its stack and notifies the combiner (lock holder). The combiner executes the shipped critical section on the waiter's behalf using a lightweight context switch. Using transparent delegation, we design a family of locking protocols, called TCLocks, that requires zero modification to applications' logic. The evaluation shows that TCLocks provide up to 5.2x performance improvement compared with recent locking algorithms.
@inproceedings{gupta:tclocks, title = {{Ship your Critical Section, Not Your Data: Enabling Transparent Delegation with TCLocks}}, author = {Vishal Gupta and Kumar Kartikeya Dwivedi and Yugesh Kothari and Yueyang Pan and Diyu Zhou and Sanidhya Kashyap}, booktitle = {Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, month = jul, year = 2023, address = {Boston, MA}, }
Data-intensive systems are the backbone of today’s computing and are responsible for shaping data centers. Over the years, cloud providers have relied on three principles to maintain cost-effective data systems: use disaggregation to decouple scaling, use domain-specific computing to battle waning laws, and use serverless to lower costs. Although they work well individually, they fail to work in harmony: an issue amplified by emerging data system workloads.
In this paper, we envision a distributed runtime to mitigate current shortcomings. The distributed runtime has a tiered access layer exposing declarative APIs, underpinned by a stateful serverless runtime with a distributed task execution model. It will be the narrow waist between data systems and hardware. Users are oblivious to data location, concurrency, disaggregation style, or even the hardware to do the computing. The underlying stateful serverless runtime transparently evolves with novel data-center architectures, such as disaggregation and tightly-coupled clusters. We prototype Skadi to showcase that the distributed runtime is practical.
@inproceedings{hu:skadi, title = {{Skadi: Building a Distributed Runtime for Data Systems in Disaggregated Data Centers}}, author = {Cunchen Hu and Chenxi Wang and Sa Wang and Ninghui Sun and Yungang Bao and Jieru Zhao and Sanidhya Kashyap and Xiaoyang Deng and Pengfei Zuo and Rongfeng He and Xushen Chen and Liangliang Xu and Qin Zhang and Hao Feng and Yizhou Shan}, booktitle = {19th USENIX Workshop on Hot Topics in Operating Systems (HotOS) (HotOS XIX)}, month = jun, year = 2023, address = {Providence, RI}, }
Byte-addressable non-volatile memory (NVM) allows programs to directly access storage using memory interface without going through the expensive conventional storage stack. However, direct access to NVM makes the NVM data vulnerable to software bugs and hardware errors. This issue is critical because, unlike DRAM, corrupted data can persist forever, even after the system restart. Albeit the plethora of research on NVM programs and systems, there is little focus on protecting NVM data from software bugs and hardware errors.
In this paper, we propose TENET, a new NVM programming framework, which guarantees memory safety and fault tolerance to protect NVM data against software bugs and hardware errors. TENET provides the popular persistent transactional memory (PTM) programming model. TENET leverages the concurrency guarantees (i.e., ACID properties) of PTM to provide performant and cost-efficient memory safety and fault tolerance. Our evaluations show that T ENET offers an enhanced protection scope at a modest performance overhead and storage cost as compared to other PTMs with partial or no memory safety and fault tolerance support.
@inproceedings{krishnan:tenet, title = {{TENET: Memory Safe and Fault tolerant Persistent Transactional Memory}}, author = {R. Madhava Krishnan and Diyu Zhou and Wook-Hee Kim and Sudarsun Kannan and Sanidhya Kashyap and Changwoo Min}, booktitle = {21st USENIX Conference on File and Storage Technologies (FAST) (FAST 23)}, month = feb, year = 2023, address = {Santa Clara, CA}, }
2022
Existing file systems for persistent memory (PM) exploit its byte-addressable non-volatile access with low latency and high bandwidth. However, they do not utilize two unique PM properties effectively. The first one is contention awareness, i.e., a small number of threads cannot thoroughly saturate the PM bandwidth, while many concurrent accesses lead to significant PM performance degradation. The second one is NUMA awareness, i.e., exploiting the remote PM efficiently, as accessing remote PM naively leads to significant performance degradation.
We present Odinfs, a NUMA-aware scalable datapath PM file system that addresses these two challenges using a novel opportunistic delegation scheme. Under this scheme, Odinfs decouples the PM accesses from application threads with the help of background threads that access PM on behalf of the application. Because of PM access decoupling, Odinfs automatically parallelizes the access to PM across NUMA nodes in a controlled and localized manner. Our evaluation shows that Odinfs outperforms existing PM file systems up to 32.7× on real-world workloads.
@inproceedings{zhou:odinfs, title = {{Odinfs: Scaling PM performance with Opportunistic Delegation}}, author = {Diyu Zhou and Yuchen Qian and Vishal Gupta and Zhifei Yang and Changwoo Min and Sanidhya Kashyap}, booktitle = {Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, month = jul, year = 2022, address = {Carlsbad, CA}, }
Kernel synchronization primitives are the backbone of any OS design. Kernel locks, for instance, are crucial for both application performance and correctness. However, unlike application locks, kernel locks are far from the reach of application developers, who have minimal interpolation of the kernel's behavior and cannot control or influence the policies that govern kernel synchronization behavior. This disconnect between the kernel and applications can lead to pathological scenarios in which optimizing the kernel synchronization primitives under one context, such as high contention, leads to adversarial effects under a context with no lock contention. In addition, rapid-evolving heterogeneous hardware makes kernel lock development too slow for modern applications with stringent performance requirements and frequent deployment timelines.
This paper addresses the above issues with application-informed kernel synchronization primitives. We allow application developers to deploy workload-specific and hardware-aware kernel lock policies to boost application performance, resolve pathological usage of kernel locks, and even enable dynamic profiling of locks of interest. To showcase this idea, we design SynCord, a framework to modify kernel locks without recompiling or rebooting the kernel. SynCord abstracts key behaviors of kernel locks and exposes them as APIs for designing user-defined kernel locks. SynCord provides the mechanisms to customize kernel locks safely and correctly from the user space. We design five lock policies specialized for new heterogeneous hardware and specific software requirements. Our evaluation shows that SynCord incurs minimal runtime overhead and generates kernel locks with performance comparable to that of the state-of-the-art locks.
@inproceedings{park:syncord, title = {{Application-Informed Kernel Synchronization Primitives}}, author = {Sujin Park and Diyu Zhou and Yuchen Qian and Irina Calciu and Taesoo Kim and Sanidhya Kashyap}, booktitle = {Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, month = jul, year = 2022, address = {Carlsbad, CA}, }
2021
RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs.
In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.
@inproceedings{monga:flock, title = {{Birds of a Feather Flock Together: Scaling RDMA RPCs with FLOCK}}, author = {Sumit Kumar Monga and Sanidhya Kashyap and Changwoo Min}, booktitle = {Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2021, address = {Koblenz, Germany}, }
Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM- specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index.
We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.
@inproceedings{kim:pactree, title = {{PACTree: A High Performance Persistent Range Index Using PAC Guidelines}}, author = {Wook-Hee Kim and R. Madhava Krishnan and Xinwei Fu and Sanidhya Kashyap and Changwoo Min}, booktitle = {Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2021, address = {Koblenz, Germany}, }
Writing a correct operating system kernel is notoriously hard. Kernel code requires manual memory management and type-unsafe code and must efficiently handle complex, asynchronous events. In addition, increasing CPU core counts further complicate kernel development. Typically, monolithic kernels share state across cores and rely on one-off synchronization patterns that are specialized for each kernel structure or subsystem. Hence, kernel developers are constantly refining synchronization within OS kernels to improve scalability at the risk of introducing subtle bugs.
We present NrOS, a new OS kernel with a safer approach to synchronization that runs many POSIX programs. NrOS is primarily constructed as a simple, sequential kernel with no concurrency, making it easier to develop and reason about its correctness. This kernel is scaled across NUMA nodes using node replication, a scheme inspired by state machine replication in distributed systems. NrOS replicates kernel state on each NUMA node and uses operation logs to maintain strong consistency between replicas. Cores can safely and concurrently read from their local kernel replica, eliminating remote NUMA accesses.
Our evaluation shows that NrOS scales to 96 cores with performance that nearly always dominates Linux at scale, in some cases by orders of magnitude, while retaining much of the simplicity of a sequential kernel.
@inproceedings{bhardwaj:nros, title = {{NrOs: Effective Replication and Sharing in an Operating System}}, author = {Ankit Bhardwaj and Chinmay Kulkarni and Reto Achermann and Irina Calciu and Sanidhya Kashyap and Ryan Stutsman and Amy Tai and Gerd Zellweger}, booktitle = {Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI)}, month = oct, year = 2021, address = {Virtual}, }
Memory-unsafe languages are widely used to implement critical systems like kernels and browsers, leading to thousands of memory safety issues every year. A use-after-free bug is a temporal memory error where the program accidentally visits a freed memory location. Recent studies show that useafter-free is one of the most exploited memory vulnerabilities. Unfortunately, previous efforts to mitigate use-after-free bugs are not widely deployed in real-world programs due to either inadequate accuracy or high performance overhead.
In this paper, we propose to resurrect the idea of one-time allocation (OTA) and provide a practical implementation with efficient execution and moderate memory overhead. With onetime allocation, the memory manager always returns a distinct memory address for each request. Since memory locations are not reused, attackers cannot reclaim freed objects, and thus cannot exploit use-after-free bugs. We utilize two techniques to render OTA practical: batch page management and the fusion of bump-pointer and fixed-size bins memory allocation styles. Batch page management helps reduce the number of system calls which negatively impact performance, while blending the two allocation methods mitigates the memory overhead and fragmentation issues. We implemented a prototype, called FFmalloc, to demonstrate our techniques. We evaluated FFmalloc on widely used benchmarks and real-world large programs. FFmalloc successfully blocked all tested useafter-free attacks while introducing moderate overhead. The results show that OTA can be a strong and practical solution to thwart use-after-free threats.
@inproceedings{wickman:ffmalloc, title = {{Preventing Use-After-Free Attacks with Fast Forward Allocation}}, author = {Brian Wickman and Hong Hu and Insu Yun and Daehee Jang and JungWon Lim and Sanidhya Kashyap and Taesoo Kim}, booktitle = {Proceedings of the 30th USENIX Security Symposium (Security)}, month = aug, year = 2021, address = {Vancouver, B.C., Canada}, }
Kernel synchronization primitives are of paramount importance to achieving good performance and scalability for applications. However, they are usually invisible and out of the reach of application developers. Instead, kernel developers and synchronization experts make all the decisions regarding kernel lock design.
In this paper, we propose contextual concurrency control (C3), a new paradigm that enables userspace applications to tune concurrency control in the kernel. C3 allows developers to change the behavior and parameters of kernel locks, to switch between different lock implementations and to dynamically profile one or multiple locks for a specific scenario of interest.
To showcase this idea, we designed and implemented CONCORD, a framework that allows a privileged userspace process to modify kernel locks on the fly without re-compiling the existing code base. We performed a preliminary evaluation on two locks showing that CONCORD allows userspace tuning of kernel locks without incurring significant overhead.
@inproceedings{park:c3, title = {{Contextual Concurrency Control}}, author = {Sujin Park and Irina Calciu and Taesoo Kim and Sanidhya Kashyap}, booktitle = {18th USENIX Workshop on Hot Topics in Operating Systems (HotOS) (HotOS XVIII)}, month = may, year = 2021, address = {Virtual}, }
Disaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual memory subsystem to transparently offer disaggregated memory to applications using a local memory abstraction. Unfortunately, using virtual memory for disaggregation has multiple limitations, including high overhead that comes from the use of page faults to identify what data to fetch and cache locally, and high dirty data amplification that comes from the use of page-granularity for tracking changes to the cached data (4KB or higher). In this paper, we propose a fundamentally new approach to designing software runtimes for disaggregated memory that addresses these limitations. Our main observation is that we can use cache coherence instead of virtual memory for tracking applications’ memory accesses transparently, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time by 1.7-5X and reduces dirty data amplification by 2-10X, compared to state-of-the-art systems.
@inproceedings{calciu:kona, title = {{Rethinking Software Runtimes for Disaggregated Memory}}, author = {Irina Calciu and M. Talha Imran and Ivan Puddu and Sanidhya Kashyap and Hasan Al Maruf and Onur Mutlu and Aasheesh Kolli}, booktitle = {Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, month = apr, year = 2021, address = {Virtual}, }
2020
Data races occur when two threads fail to use proper synchronization when accessing shared data. In kernel file systems, which are highly concurrent by design, data races are common mistakes and often wreak havoc on the users, causing inconsistent states or data losses. Prior fuzzing practices on file systems have been effective in uncovering hundreds of bugs, but they mostly focus on the sequential aspect of file system execution and do not comprehensively explore the concurrency dimension and hence, forgo the opportunity to catch data races.
In this paper, we bring coverage-guided fuzzing to the concurrency dimension with three new constructs: 1) a new coverage tracking metric, alias coverage, specially designed to capture the exploration progress in the concurrency dimension; 2) an evolution algorithm for generating, mutating, and merging multi-threaded syscall sequences as inputs for concurrency fuzzing; and 3) a comprehensive lockset and happens-before modeling for kernel synchronization primitives for precise data race detection. These components are integrated into KRACE, an end-to-end fuzzing framework that has discovered 23 data races in ext4, btrfs, and the VFS layer so far, and 9 are confirmed to be harmful.
@inproceedings{xu:krace, title = {{Krace: Data Race Fuzzing for Kernel File Systems}}, author = {Meng Xu and Sanidhya Kashyap and Hanqing Zhao and Taesoo Kim}, booktitle = {Proceedings of the 41st IEEE Symposium on Security and Privacy (Oakland)}, month = may, year = 2020, address = {San Francisco, CA}, }
2019
File systems are too large to be bug free. Although handwritten test suites have been widely used to stress file systems, they can hardly keep up with the rapid increase in file system size and complexity, leading to new bugs being introduced and reported regularly. These bugs come in various flavors: simple buffer overflows to sophisticated semantic bugs. Although bug-specific checkers exist, they generally lack a way to explore file system states thoroughly. More importantly, no turnkey solution exists that unifies the checking effort of various aspects of a file system under one umbrella.
In this paper, we highlight the potential of applying fuzzing to find not just memory errors but, in theory, any type of file system bugs with an extensible fuzzing framework: Hydra. Hydra provides building blocks for file system fuzzing, including input mutators, feedback engines, a libOS-based executor, and a bug reproducer with test case minimization. As a result, developers only need to focus on building the core logic for finding bugs of their own interests. We showcase the effectiveness of Hydra with four checkers that hunt crash inconsistency, POSIX violations, logic assertion failures, and memory errors. So far, Hydra has discovered 91 new bugs in Linux file systems, including one in a verified file system (FSCQ), as well as four POSIX violations.
@inproceedings{kim:hydra, title = {{Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework}}, author = {Seulbae Kim and Meng Xu and Sanidhya Kashyap and Jungyeon Yoon and Wen Xu and Taesoo Kim}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2019, address = {Ontario, Canada}, }
Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, in practice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance.
In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5x, with impressive memory footprint reduction compared with the recent lock algorithms.
@inproceedings{kashyap:shfllock, title = {{Scalable and Practical Locking With Shuffling}}, author = {Sanidhya Kashyap and Irina Calciu and Xiaohe Cheng and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2019, address = {Ontario, Canada}, }
We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM). The main insight behind Recipe is that isolation provided by a certain class of concurrent in-memory indexes can be translated with small changes to crash-consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Based on these conditions and conversion actions, we modify five different DRAM indexes based on B+ trees, tries, radix trees, and hash tables to their crash-consistent PM counterparts. The effort involved in this conversion is minimal, requiring 30--200 lines of code. We evaluated the converted PM indexes on Intel DC Persistent Memory, and found that they outperform state-of-the-art, hand-crafted PM indexes in multi-threaded workloads by up-to 5.2x. For example, we built P-CLHT, our PM implementation of the CLHT hash table by modifying only 30 LOC. When running YCSB workloads, P-CLHT performs up to 2.4x better than Cacheline-Conscious Extendible Hashing (CCEH), the state-of-the-art PM hash table.
@inproceedings{lee:recipe, title = {{RECIPE: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes}}, author = {Se Kwon Lee and Jayashree Mohan and Sanidhya Kashyap and Taesoo Kim and Vijay Chidambaram}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2019, address = {Ontario, Canada}, }
We present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory-mapping the underlying file, and serving the read and overwrites using processor loads and stores. Metadata operations are handled by the kernel PM file system (ext4 DAX). SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can choose from, without interfering with each other. SplitFS reduces software overhead by up-to 4x compared to the NOVA PM file system, and 17x compared to ext4 DAX. On a number of micro-benchmarks and applications such as the LevelDB key-value store running the YCSB benchmark, SplitFS increases application performance by up to 2x compared to ext4 DAX and NOVA while providing similar consistency guarantees.
@inproceedings{kadekodi:splitfs, title = {{SplitFS: Reducing Software Overhead in File Systems for Persistent Memory}}, author = {Rohan Kadekodi and Se Kwon Lee and Sanidhya Kashyap and Taesoo Kim and Aasheesh Kolli and Vijay Chidambaram}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2019, address = {Ontario, Canada}, }
File systems, a basic building block of an OS, are too big and too complex to be bug free. Nevertheless, file systems rely on regular stress-testing tools and formal checkers to find bugs, which are limited due to the ever-increasing complexity of both file systems and OSes. Thus, fuzzing, proven to be an effective and a practical approach, becomes a preferable choice, as it does not need much knowledge about a target. However, three main challenges exist in fuzzing file systems: mutating a large image blob that degrades overall performance, generating image-dependent file operations, and reproducing found bugs, which is difficult for existing OS fuzzers.
Hence, we present JANUS, the first feedback-driven fuzzer that explores the two-dimensional input space of a file system, i.e., mutating metadata on a large image, while emitting image- directed file operations. In addition, JANUS relies on a library OS rather than on traditional VMs for fuzzing, which enables JANUS to load a fresh copy of the OS, thereby leading to better reproducibility of bugs. We evaluate JANUS on eight file systems and found 90 bugs in the upstream Linux kernel, 62 of which have been acknowledged. Forty-three bugs have been fixed with 32 CVEs assigned. In addition, JANUS achieves higher code coverage on all the file systems after fuzzing 12 hours, when compared with the state-of-the-art fuzzer Syzkaller for fuzzing file systems. JANUS visits 4.19× and 2.01× more code paths in Btrfs and ext4, respectively. Moreover, JANUS is able to reproduce 88–100% of the crashes, while Syzkaller fails on all of them.
@inproceedings{xu:janus, title = {{Fuzzing File Systems via Two-Dimensional Input Space Exploration}}, author = {Wen Xu and Hyungon Moon and Sanidhya Kashyap and Po-Ning Tseng and Taesoo Kim}, booktitle = {Proceedings of the 40th IEEE Symposium on Security and Privacy (Oakland)}, month = may, year = 2019, address = {San Francisco, CA}, }
This paper presents multi-version read-log-update (MVRLU), an extension of the read-log-update (RLU) synchronization mechanism. While RLU has many merits including an intuitive programming model and excellent performance for read-mostly workloads, we observed that the performance of RLU significantly drops in workloads with more write operations. The core problem is that RLU manages only two versions. % and its log reclamation is synchronous. To overcome such limitation, we extend RLU to support multi-versioning and propose new techniques to make multi-versioning efficient. At the core of MVRLU design is concurrent autonomous garbage collection, which prevents reclaiming invisible versions being a bottleneck, and reduces the version traversal overhead - the main overhead of multi-version design. We extensively evaluate MVRLU with the state-of-the-art synchronization mechanisms, including RCU, RLU, software transactional memory (STM), and lock-free approaches, on concurrent data structures and real-world applications (database concurrency control and in-memory key-value store). Our evaluation shows that MVRLU significantly outperforms other techniques for a wide range of workloads with varying contention levels and data-set size.
@inproceedings{kim:mvrlu, title = {{MV-RLU: Scaling Read-Log-Update with Multi-Versioning}}, author = {Jaeho Kim and Ajit Mathew and Sanidhya Kashyap and Madhava Krishnan Ramanathan and Changwoo Min}, booktitle = {Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, month = apr, year = 2019, address = {Providence, RI}, }
2018
Multi-core virtual machines (VMs) are now a norm in data center environments. However, one of the well-known problems that VMs suffer from is the vCPU scheduling problem that causes poor scalability behaviors. More specifically, the symptoms of this problem appear as preemption problems in both under- and over-committed scenarios. Although prior research efforts attempted to alleviate these symptoms separately, they fail to address the common root cause of these problems: the missing semantic gap that occurs when a guest OS is preempted while executing its own critical section, thereby leading to degradation of application scalability.
In this work, we strive to address all preemption problems together by bridging the semantic gap between guest OSes and the hypervisor: the hypervisor now knows whether guest OSes are running in critical sections and a guest OS has hypervisor's scheduling context. We annotate all critical sections by using the lightweight para-virtualized APIs, so we called enlightened critical sections ($e$CS), that provide scheduling hints to both the hypervisor and VMs. The hypervisor uses the hint to reschedule a vCPU to fundamentally overcome the double scheduling problem for these annotated critical sections and VMs use the hypervisor provided hints to further mitigate the blocked-waiter wake-up problem. Our evaluation results show that $e$CS guarantees the forward progress of a guest OS by 1) decreasing preemption counts by 85--100% while 2) improving the throughput of applications up to 2.5X in an over-committed scenario and 1.6X in an under-committed scenario for various real-world workloads on an 80-core machine.
@inproceedings{kashyap:ecs, title = {{Scaling Guest OS Critical Sections with eCS}}, author = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the 2018 USENIX Annual Technical Conference (ATC)}, month = jul, year = 2018, address = {Boston, MA}, }
We propose Solros - a new operating system architecture for heterogeneous systems that comprises fast host processors, slow but massively parallel co-processors, and fast I/O devices. A general consensus to fully drive such a hardware system is to have a tight integration among processors and I/O devices. Thus, in the Solros architecture, a co-processor OS (data-plane OS) delegates its services, specifically I/O stacks, to the host OS (control-plane OS). Our observation for such a design is that global coordination with system-wide knowledge (e.g., PCIe topology, a load of each co-processor) and the best use of heterogeneous processors is critical to achieving high performance. Hence, we fully harness these specialized processors by delegating complex I/O stacks on fast host processors, which leads to an efficient global coordination at the level of the control-plane OS.
We developed Solros with Xeon Phi co-processors and implemented three core OS services: transport, file system, and network services. Our experimental results show significant performance improvement compared with the stock Xeon Phi running the Linux kernel. For example, Solros improves the throughput of file system and network operations by 19x and 7x, respectively. Moreover, it improves the performance of two realistic applications: 19x for text indexing and 2x for image search.
@inproceedings{min:solros, title = {{SOLROS: A Data-Centric Operating System Architecture for Heterogeneous Computing}}, author = {Changwoo Min and Woon-Hak Kang and Mohan Kumar and Sanidhya Kashyap and Steffen Maass and Heeseung Jo and Taesoo Kim}, booktitle = {Proceedings of the 13th European Conference on Computer Systems (EuroSys)}, month = apr, year = 2018, address = {Porto, Portugal}, }
Timestamping is an essential building block for designing concurrency control mechanisms and concurrent data structures. Various algorithms either employ physical timestamping, assuming that they have access to synchronized clocks, or maintain a logical clock with the help of atomic instructions. Unfortunately, these approaches have two problems. First, hardware developers do not guarantee that the available hardware clocks are exactly synchronized, which they find difficult to achieve in practice. Second, the atomic instructions are a deterrent to scalability resulting from cache-line contention. This paper addresses these problems by proposing and designing a scalable ordering primitive, called Ordo, that relies on invariant hardware clocks. Ordo not only enables the correct use of these clocks, by providing a notion of a global hardware clock, but also frees various logical timestamp-based algorithms from the burden of the software logical clock, while trying to simplify their design. We use the Ordo primitive to redesign 1) a concurrent data structure library that we apply on the Linux kernel; 2) a synchronization mechanism for concurrent programming; 3) two database concurrency control mechanisms; and 4) a clock-based software transactional memory algorithm. Our evaluation shows that there is a possibility that the clocks are not synchronized on two architectures ( Intel and ARM) and that Ordo generally improves the efficiency of several algorithms by 1.2-39.7x on various architectures.
@inproceedings{kashyap:ordo, title = {{A Scalable Ordering Primitive For Multicore Machines}}, author = {Sanidhya Kashyap and Changwoo Min and Kangnyeon Kim and Taesoo Kim}, booktitle = {Proceedings of the 13th European Conference on Computer Systems (EuroSys)}, month = apr, year = 2018, address = {Porto, Portugal}, }
We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.
@inproceedings{kumar:latr, title = {{LATR: Lazy Translation Coherence}}, author = {Mohan Kumar and Steffen Maass and Sanidhya Kashyap and J\'{a}n Vesel\'{y} and Zi Yan and Taesoo Kim and Abhishek Bhattacharjee and Tushar Krishna}, booktitle = {Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)}, month = mar, year = 2018, address = {Williamsburg, VA}, }
2017
In recent years, various organizations and communities have been putting numerous computing resources on automated fuzzing, which has been proved as a highly efficient approach to find security bugs in complicated software and OS kernels. Thus the performance of a fuzzer becomes crucial, and the fuzzer that can use less running time to hit more security issues saves significant cost. Existing research focuses on producing input data that is likely to explore more states of a targeted application while ignoring the performance overhead that originates from the operating system side. In fact, the system components that generic fuzzers rely on cause serious performance bottlenecks. Especially when fuzzing on multiple cores, the scalability of the state-of-the-art fuzzer (AFL) slows down by 24x because of its over exploiting the file system as a communication channel, intensively invoking the fork() system call, and heavily interacting with the file system.
In this paper, we design and implement three new operating primitives specialized for fuzzing that solve these performance bottlenecks and achieve scalable performance on multi-core machines. Our experiment shows that the proposed primitives speed up AFL and LibFuzzer by 6.1 to 28.9x and 1.1 to 735.7x, respectively, on the overall number of executions per second when targeting Google's fuzzer test suite with 120 cores. In addition, the primitives improve AFL's throughput up to 7.7x with 30 cores, which is a more common setting in data centers. Our fuzzer-agnostic primitives can be easily applied to any fuzzer with fundamental performance improvement and directly benefit large-scale fuzzing and cloud-based fuzzing services.
@inproceedings{xu:os-fuzz, title = {{Designing New Operating Primitives to Improve Fuzzing Performance}}, author = {Wen Xu and Sanidhya Kashyap and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the 24th ACM Conference on Computer and Communications Security (CCS)}, month = oct, year = 2017, address = {Dallas, TX}, }
Application scalability is a critical aspect to efficiently use NUMA machines with many cores. To achieve that, various techniques ranging from task placement to data sharding are used in practice. However, from the perspective of an operating system, these techniques often do not work as expected because various subsystems in the OS interact and share data structures among themselves, resulting in scalability bottlenecks. Although current OSes attempt to tackle this problem by introducing a wide range of synchronization primitives such as spinlock and mutex, the widely used synchronization mechanisms are not designed to handle both under- and over-subscribed scenarios in a scalable fashion. In particular, the current blocking synchronization primitives that are designed to address both scenarios are NUMA oblivious, meaning that they suffer from cache-line contention in an under-subscribed situation, and even worse, inherently spur long scheduler intervention, which leads to sub-optimal performance in an over-subscribed situation.
In this work, we present several design choices to implement scalable blocking synchronization primitives that can address both under- and over-subscribed scenarios. Such design decisions include memory-efficient NUMA-aware locks (favorable for deployment) and scheduling-aware, scalable parking and wake-up strategies. To validate our design choices, we implement two new blocking synchronization primitives, which are variants of mutex and read-write semaphore in the Linux kernel. Our evaluation shows that these locks can scale real-world applications by 1.2--1.6X and some of the file system operations up to 4.7X in both under- and over-subscribed scenarios. Moreover, they use 1.5--10X less memory than the state-of-the-art NUMA-aware locks on a 120-core machine.
@inproceedings{kashyap:cst, title = {{Scalable NUMA-aware Blocking Synchronization Primitives}}, author = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the 2017 USENIX Annual Technical Conference (ATC)}, month = jul, year = 2017, address = {Santa Clara, CA}, }
Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named MOSAIC, for a single machine. We propose a new locality-optimizing, space-efficient graph representation - Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors. Our evaluation shows that for smaller graphs, MOSAIC consistently outperforms other state-of-the-art out-of-core engines by 3.2–58.6x and shows comparable performance to distributed graph engines. Furthermore, MOSAIC can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2x.
@inproceedings{maass:mosaic, title = {{Mosaic: Processing a Trillion-Edge Graph on a Single Machine}}, author = {Steffen Maass and Changwoo Min and Sanidhya Kashyap and Woonhak Kang and Mohan Kumar and Taesoo Kim}, booktitle = {Proceedings of the 12th European Conference on Computer Systems (EuroSys)}, month = apr, year = 2017, address = {Belgrade, RS}, }
2016
In recent years, operating systems have become increasingly complex and thus prone to security and performance issues. Accordingly, system updates to address these issues have become more frequently available and increasingly important. To complete such updates, users must reboot their systems, resulting in unavoidable downtime and further loss of the states of running applications.
We present , a practical OS update mechanism that uses a userspace checkpoint-and-restart mechanism, by using an optimized data structure for checkpointing and a memory persistence mechanism across update, combined with a fast in-place kernel switch. This allows for instant kernel updates spanning across major kernel versions without any kernel modifications.
Our evaluation shows that Kup can support any type of real kernel patches (e.g., security, minor or even major releases) with large-scale applications that include memcached, mysql, or in the middle of the Linux kernel compilation, unlike well-known dynamic hot-patching techniques (e.g., ksplice). Not only that, Kup can update a running Linux kernel in 3 seconds (overall downtime).
@inproceedings{kashyap:kup, title = {{Instant OS Updates via Userspace Checkpoint-and-Restart}}, author = {Sanidhya Kashyap and Changwoo Min and Byoungyoung Lee and Taesoo Kim and Pavel Emelyanov}, booktitle = {Proceedings of the 2016 USENIX Annual Technical Conference (ATC)}, month = jun, year = 2016, address = {Denver, CO}, }
We analyze the manycore scalability of five widelydeployed file systems, namely, ext4, XFS, btrfs, F2FS, and tmpfs, by using our open source benchmark suite, FxMark. FxMark implements 19 microbenchmarks to stress specific components of each file system and includes three application benchmarks to measure the macroscopic scalability behavior. We observe that file systems are hidden scalability bottlenecks in many I/Ointensive applications even when there is no apparent contention at the application level. We found 25 scalability bottlenecks in file systems, many of which are unexpected or counterintuitive. We draw a set of observations on file system scalability behavior and unveil several core aspects of file system design that systems researchers must address.
@inproceedings{min:fxmark, title = {{Understanding Manycore Scalability of File Systems}}, author = {Changwoo Min and Sanidhya Kashyap and Steffen Maass and Woonhak Kang and Taesoo Kim}, booktitle = {Proceedings of the 2016 USENIX Annual Technical Conference (ATC)}, month = jun, year = 2016, address = {Denver, CO}, }
With increasing demand for big-data processing and faster in-memory databases, cloud providers are moving towards large virtualized instances besides focusing on the horizontal scalability.
However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., Linux kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from the previously well-known scalability problem (i.e., lock contention problem) and occurs because of the sophisticated optimization techniques implemented in the hypervisor—what we call sleepy spinlock anomaly. To solve this problem, we design and implement OTICKET, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.
@inproceedings{kashyap:oppspinlocks, title = {{Opportunistic Spinlocks: Achieving Virtual Machine Scalability in the Clouds}}, author = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the ACM SIGOPS Operating System Review}, journal = {SIGOPS Oper. Syst. Rev.}, year = 2016, month = mar, volume = 50, number = 1, }
2015
Today, systems software is too complex to be bug-free. To find bugs in systems software, developers often rely on code checkers, like Linux's Sparse. However, the capability of existing tools used in commodity, large-scale systems is limited to finding only shallow bugs that tend to be introduced by simple programmer mistakes, and so do not require a deep understanding of code to find them. Unfortunately, the majority of bugs as well as those that are difficult to find are semantic ones, which violate high-level rules or invariants (e.g., missing a permission check). Thus, it is difficult for code checkers lacking the understanding of a programmer's true intention to reason about semantic correctness.
To solve this problem, we present Juxta, a tool that automatically infers high-level semantics directly from source code. The key idea in Juxta is to compare and contrast multiple existing implementations that obey latent yet implicit high-level semantics. For example, the implementation of open() at the file system layer expects to handle an out-of-space error from the disk in all file systems. We applied Juxta to 54 file systems in the stock Linux kernel (680K LoC), found 118 previously unknown semantic bugs (one bug per 5.8K LoC), and provided corresponding patches to 39 different file systems, including mature, popular ones like ext4, btrfs, XFS, and NFS. These semantic bugs are not easy to locate, as all the ones found by Juxta have existed for over 6.2 years on average. Not only do our empirical results look promising, but the design of Juxta is generic enough to be extended easily beyond file systems to any software that has multiple implementations, like Web browsers or protocols at the same layer of a network stack.
@inproceedings{min:juxta, title = {{Cross-checking Semantic Correctness: The Case of Finding File System Bugs}}, author = {Changwoo Min and Sanidhya Kashyap and Byoungyoung Lee and Chengyu Song and Taesoo Kim}, booktitle = {Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP)}, month = oct, year = 2015, address = {Monterey, CA}, }
With increasing demand of big-data processing and faster in-memory databases, cloud providers are gearing towards large virtualized instances rather than horizontal scalability.
However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from a previously well-known scalability problem (lock contention problem), and occurs because of the sophisticated optimization techniques implemented in the hypervisor, what we call -- sleepy spinlock anomaly. To solve this problem, we design and implement oticket, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.
@inproceedings{kashyap:oticket, title = {{Scalability In The Clouds! A Myth Or Reality?}}, author = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim}, booktitle = {Proceedings of the 6th Asia-Pacific Workshop on Systems (APSys)}, month = jul, year = 2015, address = {Tokyo, Japan}, }