Publications

2022

Odinfs: Scaling PM performance with Opportunistic Delegation Diyu Zhou, Yuchen Qian, Vishal Gupta, Zhifei Yang, Changwoo Min, and Sanidhya Kashyap. OSDI 2022.

Existing file systems for persistent memory (PM) exploit its byte-addressable non-volatile access with low latency and high bandwidth. However, they do not utilize two unique PM properties effectively. The first one is contention awareness, i.e., a small number of threads cannot thoroughly saturate the PM bandwidth, while many concurrent accesses lead to significant PM performance degradation. The second one is NUMA awareness, i.e., exploiting the remote PM efficiently, as accessing remote PM naively leads to significant performance degradation.

We present Odinfs, a NUMA-aware scalable datapath PM file system that addresses these two challenges using a novel opportunistic delegation scheme. Under this scheme, Odinfs decouples the PM accesses from application threads with the help of background threads that access PM on behalf of the application. Because of PM access decoupling, Odinfs automatically parallelizes the access to PM across NUMA nodes in a controlled and localized manner. Our evaluation shows that Odinfs outperforms existing PM file systems up to 32.7× on real-world workloads.

@inproceedings{zhou:odinfs,
  title        = {{Odinfs: Scaling PM performance with Opportunistic Delegation}},
  author       = {Diyu Zhou and Yuchen Qian and Vishal Gupta and Zhifei Yang and Changwoo Min and Sanidhya Kashyap},
  booktitle    = {Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
  month        = jul,
  year         = 2022,
  address      = {Carlsbad, CA},
}
Application-Informed Kernel Synchronization Primitives Sujin Park, Diyu Zhou, Yuchen Qian, Irina Calciu, Taesoo Kim, and Sanidhya Kashyap. OSDI 2022.

Kernel synchronization primitives are the backbone of any OS design. Kernel locks, for instance, are crucial for both application performance and correctness. However, unlike application locks, kernel locks are far from the reach of application developers, who have minimal interpolation of the kernel's behavior and cannot control or influence the policies that govern kernel synchronization behavior. This disconnect between the kernel and applications can lead to pathological scenarios in which optimizing the kernel synchronization primitives under one context, such as high contention, leads to adversarial effects under a context with no lock contention. In addition, rapid-evolving heterogeneous hardware makes kernel lock development too slow for modern applications with stringent performance requirements and frequent deployment timelines.

This paper addresses the above issues with application-informed kernel synchronization primitives. We allow application developers to deploy workload-specific and hardware-aware kernel lock policies to boost application performance, resolve pathological usage of kernel locks, and even enable dynamic profiling of locks of interest. To showcase this idea, we design SynCord, a framework to modify kernel locks without recompiling or rebooting the kernel. SynCord abstracts key behaviors of kernel locks and exposes them as APIs for designing user-defined kernel locks. SynCord provides the mechanisms to customize kernel locks safely and correctly from the user space. We design five lock policies specialized for new heterogeneous hardware and specific software requirements. Our evaluation shows that SynCord incurs minimal runtime overhead and generates kernel locks with performance comparable to that of the state-of-the-art locks.

@inproceedings{park:syncord,
  title        = {{Application-Informed Kernel Synchronization Primitives}},
  author       = {Sujin Park and Diyu Zhou and Yuchen Qian and Irina Calciu and Taesoo Kim and Sanidhya Kashyap},
  booktitle    = {Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
  month        = jul,
  year         = 2022,
  address      = {Carlsbad, CA},
}
Beating the I/O bottleneck: A case for log-structured virtual disks Mohammad Hossein Hajkazemi, Vojtech Aschenbrenner, Mania Abdi, Emine Ugur Kaynar, Amin Mossayebzadeh, Orran Krieger, and Peter Desnoyers. EuroSys 2022.

With the increasing dominance of SSDs for local storage, today's network mounted virtual disks can no longer offer competitive performance. We propose a Log-Structured Virtual Disk (LSVD) that couples log-structured approaches at both the cache and storage layer to provide a virtual disk on top of S3-like storage. Both cache and backend store are orderpreserving, enabling LSVD to provide strong consistency guarantees in case of failure. Our prototype demonstrates that the approach preserves all the advantages of virtual disks, while offering dramatic performance improvements over not only commonly used virtual disks, but the same disks combined with inconsistent (i.e. unsafe) local caching.

@inproceedings{hajkazemi-aschenbrenner:lsvd,
  title        = {{Beating the I/O bottleneck: A case for log-structured virtual disks}},
  author       = {Mohammad Hossein Hajkazemi and Vojtech Aschenbrenner and Mania Abdi and Emine Ugur Kaynar and Amin Mossayebzadeh and Orran Krieger
    and Peter Desnoyers},
  booktitle    = {Proceedings of the 17th European Conference on Computer Systems (EuroSys)},
  month        = apr,
  year         = 2022,
  address      = {Rennes, France},
}

2021

Birds of a Feather Flock Together: Scaling RDMA RPCs with FLOCK Sumit Kumar Monga, Sanidhya Kashyap, and Changwoo Min. SOSP 2021.

RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs.

In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.

@inproceedings{monga:flock,
  title        = {{Birds of a Feather Flock Together: Scaling RDMA RPCs with FLOCK}},
  author       = {Sumit Kumar Monga and Sanidhya Kashyap and Changwoo Min},
  booktitle    = {Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2021,
  address      = {Koblenz, Germany},
}
PACTree: A High Performance Persistent Range Index Using PAC Guidelines Wook-Hee Kim, R. Madhava Krishnan, Xinwei Fu, Sanidhya Kashyap, and Changwoo Min. SOSP 2021.

Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM- specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index.

We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.

@inproceedings{kim:pactree,
  title        = {{PACTree: A High Performance Persistent Range Index Using PAC Guidelines}},
  author       = {Wook-Hee Kim and R. Madhava Krishnan and Xinwei Fu and Sanidhya Kashyap and Changwoo Min},
  booktitle    = {Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2021,
  address      = {Koblenz, Germany},
}
NrOs: Effective Replication and Sharing in an Operating System Ankit Bhardwaj, Chinmay Kulkarni, Reto Achermann, Irina Calciu, Sanidhya Kashyap, Ryan Stutsman, Amy Tai, and Gerd Zellweger. OSDI 2021.
@inproceedings{bhardwaj:nros,
  title        = {{NrOs: Effective Replication and Sharing in an Operating System}},
  author       = {Ankit Bhardwaj and Chinmay Kulkarni and Reto Achermann and Irina Calciu and Sanidhya Kashyap and Ryan Stutsman and Amy Tai and Gerd
    Zellweger},
  booktitle    = {Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
  month        = oct,
  year         = 2021,
  address      = {Virtual},
}
Preventing Use-After-Free Attacks with Fast Forward Allocation Brian Wickman, Hong Hu, Insu Yun, Daehee Jang, JungWon Lim, Sanidhya Kashyap, and Taesoo Kim. USENIX Security 2021.

Memory-unsafe languages are widely used to implement critical systems like kernels and browsers, leading to thousands of memory safety issues every year. A use-after-free bug is a temporal memory error where the program accidentally visits a freed memory location. Recent studies show that useafter-free is one of the most exploited memory vulnerabilities. Unfortunately, previous efforts to mitigate use-after-free bugs are not widely deployed in real-world programs due to either inadequate accuracy or high performance overhead.

In this paper, we propose to resurrect the idea of one-time allocation (OTA) and provide a practical implementation with efficient execution and moderate memory overhead. With onetime allocation, the memory manager always returns a distinct memory address for each request. Since memory locations are not reused, attackers cannot reclaim freed objects, and thus cannot exploit use-after-free bugs. We utilize two techniques to render OTA practical: batch page management and the fusion of bump-pointer and fixed-size bins memory allocation styles. Batch page management helps reduce the number of system calls which negatively impact performance, while blending the two allocation methods mitigates the memory overhead and fragmentation issues. We implemented a prototype, called FFmalloc, to demonstrate our techniques. We evaluated FFmalloc on widely used benchmarks and real-world large programs. FFmalloc successfully blocked all tested useafter-free attacks while introducing moderate overhead. The results show that OTA can be a strong and practical solution to thwart use-after-free threats.

@inproceedings{wickman:ffmalloc,
  title        = {{Preventing Use-After-Free Attacks with Fast Forward Allocation}},
  author       = {Brian Wickman and Hong Hu and Insu Yun and Daehee Jang and JungWon Lim and Sanidhya Kashyap and Taesoo Kim},
  booktitle    = {Proceedings of the 30th USENIX Security Symposium (Security)},
  month        = aug,
  year         = 2021,
  address      = {Vancouver, B.C., Canada},
}
Contextual Concurrency Control Sujin Park, Irina Calciu, Taesoo Kim, and Sanidhya Kashyap. HotOS XVIII 2021.

Kernel synchronization primitives are of paramount importance to achieving good performance and scalability for applications. However, they are usually invisible and out of the reach of application developers. Instead, kernel developers and synchronization experts make all the decisions regarding kernel lock design.

In this paper, we propose contextual concurrency control (C3), a new paradigm that enables userspace applications to tune concurrency control in the kernel. C3 allows developers to change the behavior and parameters of kernel locks, to switch between different lock implementations and to dynamically profile one or multiple locks for a specific scenario of interest.

To showcase this idea, we designed and implemented CONCORD, a framework that allows a privileged userspace process to modify kernel locks on the fly without re-compiling the existing code base. We performed a preliminary evaluation on two locks showing that CONCORD allows userspace tuning of kernel locks without incurring significant overhead.

@inproceedings{park:c3,
  title        = {{Contextual Concurrency Control}},
  author       = {Sujin Park and Irina Calciu and Taesoo Kim and Sanidhya Kashyap},
  booktitle    = {18th USENIX Workshop on Hot Topics in Operating Systems (HotOS) (HotOS XVIII)},
  month        = may,
  year         = 2021,
  address      = {Virtual},
}
Rethinking Software Runtimes for Disaggregated Memory Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. ASPLOS 2021.

Disaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual memory subsystem to transparently offer disaggregated memory to applications using a local memory abstraction. Unfortunately, using virtual memory for disaggregation has multiple limitations, including high overhead that comes from the use of page faults to identify what data to fetch and cache locally, and high dirty data amplification that comes from the use of page-granularity for tracking changes to the cached data (4KB or higher). In this paper, we propose a fundamentally new approach to designing software runtimes for disaggregated memory that addresses these limitations. Our main observation is that we can use cache coherence instead of virtual memory for tracking applications’ memory accesses transparently, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time by 1.7-5X and reduces dirty data amplification by 2-10X, compared to state-of-the-art systems.

@inproceedings{calciu:kona,
  title        = {{Rethinking Software Runtimes for Disaggregated Memory}},
  author       = {Irina Calciu and M. Talha Imran and Ivan Puddu and Sanidhya Kashyap and Hasan Al Maruf and Onur Mutlu and Aasheesh Kolli},
  booktitle    = {Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    (ASPLOS)},
  month        = apr,
  year         = 2021,
  address      = {Virtual},
}

2020

Krace: Data Race Fuzzing for Kernel File Systems Meng Xu, Sanidhya Kashyap, Hanqing Zhao, and Taesoo Kim. IEEE SP 2020.

Data races occur when two threads fail to use proper synchronization when accessing shared data. In kernel file systems, which are highly concurrent by design, data races are common mistakes and often wreak havoc on the users, causing inconsistent states or data losses. Prior fuzzing practices on file systems have been effective in uncovering hundreds of bugs, but they mostly focus on the sequential aspect of file system execution and do not comprehensively explore the concurrency dimension and hence, forgo the opportunity to catch data races.

In this paper, we bring coverage-guided fuzzing to the concurrency dimension with three new constructs: 1) a new coverage tracking metric, alias coverage, specially designed to capture the exploration progress in the concurrency dimension; 2) an evolution algorithm for generating, mutating, and merging multi-threaded syscall sequences as inputs for concurrency fuzzing; and 3) a comprehensive lockset and happens-before modeling for kernel synchronization primitives for precise data race detection. These components are integrated into KRACE, an end-to-end fuzzing framework that has discovered 23 data races in ext4, btrfs, and the VFS layer so far, and 9 are confirmed to be harmful.

@inproceedings{xu:krace,
  title        = {{Krace: Data Race Fuzzing for Kernel File Systems}},
  author       = {Meng Xu and Sanidhya Kashyap and Hanqing Zhao and Taesoo Kim},
  booktitle    = {Proceedings of the 41st IEEE Symposium on Security and Privacy (Oakland)},
  month        = may,
  year         = 2020,
  address      = {San Francisco, CA},
}

2019

Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework Seulbae Kim, Meng Xu, Sanidhya Kashyap, Jungyeon Yoon, Wen Xu, and Taesoo Kim. SOSP 2019.

File systems are too large to be bug free. Although handwritten test suites have been widely used to stress file systems, they can hardly keep up with the rapid increase in file system size and complexity, leading to new bugs being introduced and reported regularly. These bugs come in various flavors: simple buffer overflows to sophisticated semantic bugs. Although bug-specific checkers exist, they generally lack a way to explore file system states thoroughly. More importantly, no turnkey solution exists that unifies the checking effort of various aspects of a file system under one umbrella.

In this paper, we highlight the potential of applying fuzzing to find not just memory errors but, in theory, any type of file system bugs with an extensible fuzzing framework: Hydra. Hydra provides building blocks for file system fuzzing, including input mutators, feedback engines, a libOS-based executor, and a bug reproducer with test case minimization. As a result, developers only need to focus on building the core logic for finding bugs of their own interests. We showcase the effectiveness of Hydra with four checkers that hunt crash inconsistency, POSIX violations, logic assertion failures, and memory errors. So far, Hydra has discovered 91 new bugs in Linux file systems, including one in a verified file system (FSCQ), as well as four POSIX violations.

@inproceedings{kim:hydra,
  title        = {{Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework}},
  author       = {Seulbae Kim and Meng Xu and Sanidhya Kashyap and Jungyeon Yoon and Wen Xu and Taesoo Kim},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2019,
  address      = {Ontario, Canada},
}
Scalable and Practical Locking With Shuffling Sanidhya Kashyap, Irina Calciu, Xiaohe Cheng, Changwoo Min, and Taesoo Kim. SOSP 2019.

Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, in practice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance.

In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5x, with impressive memory footprint reduction compared with the recent lock algorithms.

@inproceedings{kashyap:shfllock,
  title        = {{Scalable and Practical Locking With Shuffling}},
  author       = {Sanidhya Kashyap and Irina Calciu and Xiaohe Cheng and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2019,
  address      = {Ontario, Canada},
}
RECIPE: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. SOSP 2019.

We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM). The main insight behind Recipe is that isolation provided by a certain class of concurrent in-memory indexes can be translated with small changes to crash-consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Based on these conditions and conversion actions, we modify five different DRAM indexes based on B+ trees, tries, radix trees, and hash tables to their crash-consistent PM counterparts. The effort involved in this conversion is minimal, requiring 30--200 lines of code. We evaluated the converted PM indexes on Intel DC Persistent Memory, and found that they outperform state-of-the-art, hand-crafted PM indexes in multi-threaded workloads by up-to 5.2x. For example, we built P-CLHT, our PM implementation of the CLHT hash table by modifying only 30 LOC. When running YCSB workloads, P-CLHT performs up to 2.4x better than Cacheline-Conscious Extendible Hashing (CCEH), the state-of-the-art PM hash table.

@inproceedings{lee:recipe,
  title        = {{RECIPE: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes}},
  author       = {Se Kwon Lee and Jayashree Mohan and Sanidhya Kashyap and Taesoo Kim and Vijay Chidambaram},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2019,
  address      = {Ontario, Canada},
}
SplitFS: Reducing Software Overhead in File Systems for Persistent Memory Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. SOSP 2019.

We present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory-mapping the underlying file, and serving the read and overwrites using processor loads and stores. Metadata operations are handled by the kernel PM file system (ext4 DAX). SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can choose from, without interfering with each other. SplitFS reduces software overhead by up-to 4x compared to the NOVA PM file system, and 17x compared to ext4 DAX. On a number of micro-benchmarks and applications such as the LevelDB key-value store running the YCSB benchmark, SplitFS increases application performance by up to 2x compared to ext4 DAX and NOVA while providing similar consistency guarantees.

@inproceedings{kadekodi:splitfs,
  title        = {{SplitFS: Reducing Software Overhead in File Systems for Persistent Memory}},
  author       = {Rohan Kadekodi and Se Kwon Lee and Sanidhya Kashyap and Taesoo Kim and Aasheesh Kolli and Vijay Chidambaram},
  booktitle    = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2019,
  address      = {Ontario, Canada},
}
Fuzzing File Systems via Two-Dimensional Input Space Exploration Wen Xu, Hyungon Moon, Sanidhya Kashyap, Po-Ning Tseng, and Taesoo Kim. IEEE SP 2019.

File systems, a basic building block of an OS, are too big and too complex to be bug free. Nevertheless, file systems rely on regular stress-testing tools and formal checkers to find bugs, which are limited due to the ever-increasing complexity of both file systems and OSes. Thus, fuzzing, proven to be an effective and a practical approach, becomes a preferable choice, as it does not need much knowledge about a target. However, three main challenges exist in fuzzing file systems: mutating a large image blob that degrades overall performance, generating image-dependent file operations, and reproducing found bugs, which is difficult for existing OS fuzzers.

Hence, we present JANUS, the first feedback-driven fuzzer that explores the two-dimensional input space of a file system, i.e., mutating metadata on a large image, while emitting image- directed file operations. In addition, JANUS relies on a library OS rather than on traditional VMs for fuzzing, which enables JANUS to load a fresh copy of the OS, thereby leading to better reproducibility of bugs. We evaluate JANUS on eight file systems and found 90 bugs in the upstream Linux kernel, 62 of which have been acknowledged. Forty-three bugs have been fixed with 32 CVEs assigned. In addition, JANUS achieves higher code coverage on all the file systems after fuzzing 12 hours, when compared with the state-of-the-art fuzzer Syzkaller for fuzzing file systems. JANUS visits 4.19× and 2.01× more code paths in Btrfs and ext4, respectively. Moreover, JANUS is able to reproduce 88–100% of the crashes, while Syzkaller fails on all of them.

@inproceedings{xu:janus,
  title        = {{Fuzzing File Systems via Two-Dimensional Input Space Exploration}},
  author       = {Wen Xu and Hyungon Moon and Sanidhya Kashyap and Po-Ning Tseng and Taesoo Kim},
  booktitle    = {Proceedings of the 40th IEEE Symposium on Security and Privacy (Oakland)},
  month        = may,
  year         = 2019,
  address      = {San Francisco, CA},
}
MV-RLU: Scaling Read-Log-Update with Multi-Versioning Jaeho Kim, Ajit Mathew, Sanidhya Kashyap, Madhava Krishnan Ramanathan, and Changwoo Min. ASPLOS 2019.

This paper presents multi-version read-log-update (MVRLU), an extension of the read-log-update (RLU) synchronization mechanism. While RLU has many merits including an intuitive programming model and excellent performance for read-mostly workloads, we observed that the performance of RLU significantly drops in workloads with more write operations. The core problem is that RLU manages only two versions. % and its log reclamation is synchronous. To overcome such limitation, we extend RLU to support multi-versioning and propose new techniques to make multi-versioning efficient. At the core of MVRLU design is concurrent autonomous garbage collection, which prevents reclaiming invisible versions being a bottleneck, and reduces the version traversal overhead - the main overhead of multi-version design. We extensively evaluate MVRLU with the state-of-the-art synchronization mechanisms, including RCU, RLU, software transactional memory (STM), and lock-free approaches, on concurrent data structures and real-world applications (database concurrency control and in-memory key-value store). Our evaluation shows that MVRLU significantly outperforms other techniques for a wide range of workloads with varying contention levels and data-set size.

@inproceedings{kim:mvrlu,
  title        = {{MV-RLU: Scaling Read-Log-Update with Multi-Versioning}},
  author       = {Jaeho Kim and Ajit Mathew and Sanidhya Kashyap and Madhava Krishnan Ramanathan and Changwoo Min},
  booktitle    = {Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    (ASPLOS)},
  month        = apr,
  year         = 2019,
  address      = {Providence, RI},
}

2018

Scaling Guest OS Critical Sections with eCS Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. ATC 2018.

Multi-core virtual machines (VMs) are now a norm in data center environments. However, one of the well-known problems that VMs suffer from is the vCPU scheduling problem that causes poor scalability behaviors. More specifically, the symptoms of this problem appear as preemption problems in both under- and over-committed scenarios. Although prior research efforts attempted to alleviate these symptoms separately, they fail to address the common root cause of these problems: the missing semantic gap that occurs when a guest OS is preempted while executing its own critical section, thereby leading to degradation of application scalability.

In this work, we strive to address all preemption problems together by bridging the semantic gap between guest OSes and the hypervisor: the hypervisor now knows whether guest OSes are running in critical sections and a guest OS has hypervisor's scheduling context. We annotate all critical sections by using the lightweight para-virtualized APIs, so we called enlightened critical sections ($e$CS), that provide scheduling hints to both the hypervisor and VMs. The hypervisor uses the hint to reschedule a vCPU to fundamentally overcome the double scheduling problem for these annotated critical sections and VMs use the hypervisor provided hints to further mitigate the blocked-waiter wake-up problem. Our evaluation results show that $e$CS guarantees the forward progress of a guest OS by 1) decreasing preemption counts by 85--100% while 2) improving the throughput of applications up to 2.5X in an over-committed scenario and 1.6X in an under-committed scenario for various real-world workloads on an 80-core machine.

@inproceedings{kashyap:ecs,
  title        = {{Scaling Guest OS Critical Sections with eCS}},
  author       = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the 2018 USENIX Annual Technical Conference (ATC)},
  month        = jul,
  year         = 2018,
  address      = {Boston, MA},
}
SOLROS: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woon-Hak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, and Taesoo Kim. EuroSys 2018.

We propose Solros - a new operating system architecture for heterogeneous systems that comprises fast host processors, slow but massively parallel co-processors, and fast I/O devices. A general consensus to fully drive such a hardware system is to have a tight integration among processors and I/O devices. Thus, in the Solros architecture, a co-processor OS (data-plane OS) delegates its services, specifically I/O stacks, to the host OS (control-plane OS). Our observation for such a design is that global coordination with system-wide knowledge (e.g., PCIe topology, a load of each co-processor) and the best use of heterogeneous processors is critical to achieving high performance. Hence, we fully harness these specialized processors by delegating complex I/O stacks on fast host processors, which leads to an efficient global coordination at the level of the control-plane OS.

We developed Solros with Xeon Phi co-processors and implemented three core OS services: transport, file system, and network services. Our experimental results show significant performance improvement compared with the stock Xeon Phi running the Linux kernel. For example, Solros improves the throughput of file system and network operations by 19x and 7x, respectively. Moreover, it improves the performance of two realistic applications: 19x for text indexing and 2x for image search.

@inproceedings{min:solros,
  title        = {{SOLROS: A Data-Centric Operating System Architecture for Heterogeneous Computing}},
  author       = {Changwoo Min and Woon-Hak Kang and Mohan Kumar and Sanidhya Kashyap and Steffen Maass and Heeseung Jo and Taesoo Kim},
  booktitle    = {Proceedings of the 13th European Conference on Computer Systems (EuroSys)},
  month        = apr,
  year         = 2018,
  address      = {Porto, Portugal},
}
A Scalable Ordering Primitive For Multicore Machines Sanidhya Kashyap, Changwoo Min, Kangnyeon Kim, and Taesoo Kim. EuroSys 2018.

Timestamping is an essential building block for designing concurrency control mechanisms and concurrent data structures. Various algorithms either employ physical timestamping, assuming that they have access to synchronized clocks, or maintain a logical clock with the help of atomic instructions. Unfortunately, these approaches have two problems. First, hardware developers do not guarantee that the available hardware clocks are exactly synchronized, which they find difficult to achieve in practice. Second, the atomic instructions are a deterrent to scalability resulting from cache-line contention. This paper addresses these problems by proposing and designing a scalable ordering primitive, called Ordo, that relies on invariant hardware clocks. Ordo not only enables the correct use of these clocks, by providing a notion of a global hardware clock, but also frees various logical timestamp-based algorithms from the burden of the software logical clock, while trying to simplify their design. We use the Ordo primitive to redesign 1) a concurrent data structure library that we apply on the Linux kernel; 2) a synchronization mechanism for concurrent programming; 3) two database concurrency control mechanisms; and 4) a clock-based software transactional memory algorithm. Our evaluation shows that there is a possibility that the clocks are not synchronized on two architectures ( Intel and ARM) and that Ordo generally improves the efficiency of several algorithms by 1.2-39.7x on various architectures.

@inproceedings{kashyap:ordo,
  title        = {{A Scalable Ordering Primitive For Multicore Machines}},
  author       = {Sanidhya Kashyap and Changwoo Min and Kangnyeon Kim and Taesoo Kim},
  booktitle    = {Proceedings of the 13th European Conference on Computer Systems (EuroSys)},
  month        = apr,
  year         = 2018,
  address      = {Porto, Portugal},
}
LATR: Lazy Translation Coherence Mohan Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. ASPLOS 2018.

We propose LATR-lazy TLB coherence-a software-based TLB shootdown mechanism that can alleviate the overhead of the synchronous TLB shootdown mechanism in existing operating systems. By handling the TLB coherence in a lazy fashion, LATR can avoid expensive IPIs which are required for delivering a shootdown signal to remote cores, and the performance overhead of associated interrupt handlers. Therefore, virtual memory operations, such as free and page migration operations, can benefit significantly from LATR's mechanism. For example, LATR improves the latency of munmap by 70.8% on a 2-socket machine, a widely used configuration in modern data centers. Real-world, performance-critical applications such as web servers can also benefit from LATR: without any application-level changes, LATR improves Apache by 59.9% compared to Linux, and by 37.9% compared to ABIS, a highly optimized, state-of-the-art TLB coherence technique.

@inproceedings{kumar:latr,
  title        = {{LATR: Lazy Translation Coherence}},
  author       = {Mohan Kumar and Steffen Maass and Sanidhya Kashyap and J\'{a}n Vesel\'{y} and Zi Yan and Taesoo Kim and Abhishek Bhattacharjee and
    Tushar Krishna},
  booktitle    = {Proceedings of the 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    (ASPLOS)},
  month        = mar,
  year         = 2018,
  address      = {Williamsburg, VA},
}

2017

Designing New Operating Primitives to Improve Fuzzing Performance Wen Xu, Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. CCS 2017.

In recent years, various organizations and communities have been putting numerous computing resources on automated fuzzing, which has been proved as a highly efficient approach to find security bugs in complicated software and OS kernels. Thus the performance of a fuzzer becomes crucial, and the fuzzer that can use less running time to hit more security issues saves significant cost. Existing research focuses on producing input data that is likely to explore more states of a targeted application while ignoring the performance overhead that originates from the operating system side. In fact, the system components that generic fuzzers rely on cause serious performance bottlenecks. Especially when fuzzing on multiple cores, the scalability of the state-of-the-art fuzzer (AFL) slows down by 24x because of its over exploiting the file system as a communication channel, intensively invoking the fork() system call, and heavily interacting with the file system.

In this paper, we design and implement three new operating primitives specialized for fuzzing that solve these performance bottlenecks and achieve scalable performance on multi-core machines. Our experiment shows that the proposed primitives speed up AFL and LibFuzzer by 6.1 to 28.9x and 1.1 to 735.7x, respectively, on the overall number of executions per second when targeting Google's fuzzer test suite with 120 cores. In addition, the primitives improve AFL's throughput up to 7.7x with 30 cores, which is a more common setting in data centers. Our fuzzer-agnostic primitives can be easily applied to any fuzzer with fundamental performance improvement and directly benefit large-scale fuzzing and cloud-based fuzzing services.

@inproceedings{xu:os-fuzz,
  title        = {{Designing New Operating Primitives to Improve Fuzzing Performance}},
  author       = {Wen Xu and Sanidhya Kashyap and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the 24th ACM Conference on Computer and Communications Security (CCS)},
  month        = oct,
  year         = 2017,
  address      = {Dallas, TX},
}
Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. ATC 2017.

Application scalability is a critical aspect to efficiently use NUMA machines with many cores. To achieve that, various techniques ranging from task placement to data sharding are used in practice. However, from the perspective of an operating system, these techniques often do not work as expected because various subsystems in the OS interact and share data structures among themselves, resulting in scalability bottlenecks. Although current OSes attempt to tackle this problem by introducing a wide range of synchronization primitives such as spinlock and mutex, the widely used synchronization mechanisms are not designed to handle both under- and over-subscribed scenarios in a scalable fashion. In particular, the current blocking synchronization primitives that are designed to address both scenarios are NUMA oblivious, meaning that they suffer from cache-line contention in an under-subscribed situation, and even worse, inherently spur long scheduler intervention, which leads to sub-optimal performance in an over-subscribed situation.

In this work, we present several design choices to implement scalable blocking synchronization primitives that can address both under- and over-subscribed scenarios. Such design decisions include memory-efficient NUMA-aware locks (favorable for deployment) and scheduling-aware, scalable parking and wake-up strategies. To validate our design choices, we implement two new blocking synchronization primitives, which are variants of mutex and read-write semaphore in the Linux kernel. Our evaluation shows that these locks can scale real-world applications by 1.2--1.6X and some of the file system operations up to 4.7X in both under- and over-subscribed scenarios. Moreover, they use 1.5--10X less memory than the state-of-the-art NUMA-aware locks on a 120-core machine.

@inproceedings{kashyap:cst,
  title        = {{Scalable NUMA-aware Blocking Synchronization Primitives}},
  author       = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the 2017 USENIX Annual Technical Conference (ATC)},
  month        = jul,
  year         = 2017,
  address      = {Santa Clara, CA},
}
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. EuroSys 2017.

Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named MOSAIC, for a single machine. We propose a new locality-optimizing, space-efficient graph representation - Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors. Our evaluation shows that for smaller graphs, MOSAIC consistently outperforms other state-of-the-art out-of-core engines by 3.2–58.6x and shows comparable performance to distributed graph engines. Furthermore, MOSAIC can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2x.

@inproceedings{maass:mosaic,
  title        = {{Mosaic: Processing a Trillion-Edge Graph on a Single Machine}},
  author       = {Steffen Maass and Changwoo Min and Sanidhya Kashyap and Woonhak Kang and Mohan Kumar and Taesoo Kim},
  booktitle    = {Proceedings of the 12th European Conference on Computer Systems (EuroSys)},
  month        = apr,
  year         = 2017,
  address      = {Belgrade, RS},
}

2016

Instant OS Updates via Userspace Checkpoint-and-Restart Sanidhya Kashyap, Changwoo Min, Byoungyoung Lee, Taesoo Kim, and Pavel Emelyanov. ATC 2016.

In recent years, operating systems have become increasingly complex and thus prone to security and performance issues. Accordingly, system updates to address these issues have become more frequently available and increasingly important. To complete such updates, users must reboot their systems, resulting in unavoidable downtime and further loss of the states of running applications.

We present , a practical OS update mechanism that uses a userspace checkpoint-and-restart mechanism, by using an optimized data structure for checkpointing and a memory persistence mechanism across update, combined with a fast in-place kernel switch. This allows for instant kernel updates spanning across major kernel versions without any kernel modifications.

Our evaluation shows that Kup can support any type of real kernel patches (e.g., security, minor or even major releases) with large-scale applications that include memcached, mysql, or in the middle of the Linux kernel compilation, unlike well-known dynamic hot-patching techniques (e.g., ksplice). Not only that, Kup can update a running Linux kernel in 3 seconds (overall downtime).

@inproceedings{kashyap:kup,
  title        = {{Instant OS Updates via Userspace Checkpoint-and-Restart}},
  author       = {Sanidhya Kashyap and Changwoo Min and Byoungyoung Lee and Taesoo Kim and Pavel Emelyanov},
  booktitle    = {Proceedings of the 2016 USENIX Annual Technical Conference (ATC)},
  month        = jun,
  year         = 2016,
  address      = {Denver, CO},
}
Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass, Woonhak Kang, and Taesoo Kim. ATC 2016.

We analyze the manycore scalability of five widelydeployed file systems, namely, ext4, XFS, btrfs, F2FS, and tmpfs, by using our open source benchmark suite, FxMark. FxMark implements 19 microbenchmarks to stress specific components of each file system and includes three application benchmarks to measure the macroscopic scalability behavior. We observe that file systems are hidden scalability bottlenecks in many I/Ointensive applications even when there is no apparent contention at the application level. We found 25 scalability bottlenecks in file systems, many of which are unexpected or counterintuitive. We draw a set of observations on file system scalability behavior and unveil several core aspects of file system design that systems researchers must address.

@inproceedings{min:fxmark,
  title        = {{Understanding Manycore Scalability of File Systems}},
  author       = {Changwoo Min and Sanidhya Kashyap and Steffen Maass and Woonhak Kang and Taesoo Kim},
  booktitle    = {Proceedings of the 2016 USENIX Annual Technical Conference (ATC)},
  month        = jun,
  year         = 2016,
  address      = {Denver, CO},
}
Opportunistic Spinlocks: Achieving Virtual Machine Scalability in the Clouds Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. OSR 2016.

With increasing demand for big-data processing and faster in-memory databases, cloud providers are moving towards large virtualized instances besides focusing on the horizontal scalability.

However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., Linux kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from the previously well-known scalability problem (i.e., lock contention problem) and occurs because of the sophisticated optimization techniques implemented in the hypervisor—what we call sleepy spinlock anomaly. To solve this problem, we design and implement OTICKET, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.

@inproceedings{kashyap:oppspinlocks,
  title        = {{Opportunistic Spinlocks: Achieving Virtual Machine Scalability in the Clouds}},
  author       = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the ACM SIGOPS Operating System Review},
  journal      = {SIGOPS Oper. Syst. Rev.},
  year         = 2016,
  month        = mar,
  volume       = 50,
  number       = 1,
}

2015

Cross-checking Semantic Correctness: The Case of Finding File System Bugs Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. SOSP 2015.

Today, systems software is too complex to be bug-free. To find bugs in systems software, developers often rely on code checkers, like Linux's Sparse. However, the capability of existing tools used in commodity, large-scale systems is limited to finding only shallow bugs that tend to be introduced by simple programmer mistakes, and so do not require a deep understanding of code to find them. Unfortunately, the majority of bugs as well as those that are difficult to find are semantic ones, which violate high-level rules or invariants (e.g., missing a permission check). Thus, it is difficult for code checkers lacking the understanding of a programmer's true intention to reason about semantic correctness.

To solve this problem, we present Juxta, a tool that automatically infers high-level semantics directly from source code. The key idea in Juxta is to compare and contrast multiple existing implementations that obey latent yet implicit high-level semantics. For example, the implementation of open() at the file system layer expects to handle an out-of-space error from the disk in all file systems. We applied Juxta to 54 file systems in the stock Linux kernel (680K LoC), found 118 previously unknown semantic bugs (one bug per 5.8K LoC), and provided corresponding patches to 39 different file systems, including mature, popular ones like ext4, btrfs, XFS, and NFS. These semantic bugs are not easy to locate, as all the ones found by Juxta have existed for over 6.2 years on average. Not only do our empirical results look promising, but the design of Juxta is generic enough to be extended easily beyond file systems to any software that has multiple implementations, like Web browsers or protocols at the same layer of a network stack.

@inproceedings{min:juxta,
  title        = {{Cross-checking Semantic Correctness: The Case of Finding File System Bugs}},
  author       = {Changwoo Min and Sanidhya Kashyap and Byoungyoung Lee and Chengyu Song and Taesoo Kim},
  booktitle    = {Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP)},
  month        = oct,
  year         = 2015,
  address      = {Monterey, CA},
}
Scalability In The Clouds! A Myth Or Reality? Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. APSys 2015.

With increasing demand of big-data processing and faster in-memory databases, cloud providers are gearing towards large virtualized instances rather than horizontal scalability.

However, our experiments reveal that such instances in popular cloud services (e.g., 32 vCPUs with 208 GB supported by Google Compute Engine) do not achieve the desired scalability with increasing core count even with a simple, embarrassingly parallel job (e.g., kernel compile). On a serious note, the internal synchronization scheme (e.g., paravirtualized ticket spinlock) of the virtualized instance on a machine with higher core count (e.g., 80-core) dramatically degrades its overall performance. Our finding is different from a previously well-known scalability problem (lock contention problem), and occurs because of the sophisticated optimization techniques implemented in the hypervisor, what we call -- sleepy spinlock anomaly. To solve this problem, we design and implement oticket, a variant of paravirtualized ticket spinlock that effectively scales the virtualized instances in both undersubscribed and oversubscribed environments.

@inproceedings{kashyap:oticket,
  title        = {{Scalability In The Clouds! A Myth Or Reality?}},
  author       = {Sanidhya Kashyap and Changwoo Min and Taesoo Kim},
  booktitle    = {Proceedings of the 6th Asia-Pacific Workshop on Systems (APSys)},
  month        = jul,
  year         = 2015,
  address      = {Tokyo, Japan},
}