One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Authors: 

Ruiming Lu, University of Michigan and Shanghai Jiao Tong University; Yunchi Lu and Yuxuan Jiang, University of Michigan; Guangtao Xue, Shanghai Jiao Tong University; Peng Huang, University of Michigan

Abstract: 

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.

NSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {306013,
author = {Ruiming Lu and Yunchi Lu and Yuxuan Jiang and Guangtao Xue and Peng Huang},
title = {{One-Size-Fits-None}: Understanding and Enhancing {Slow-Fault} Tolerance in Modern Distributed Systems},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)},
year = {2025},
isbn = {978-1-939133-46-5},
address = {Philadelphia, PA},
pages = {359--378},
url = {https://www.usenix.org/conference/nsdi25/presentation/lu},
publisher = {USENIX Association},
month = apr
}

Presentation Video