Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, and Dennis Cai, Alibaba Cloud
Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase- 1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively chose to customize the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count and 71% of the performance degradation.
NSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Jianbo Dong and Kun Qian and Pengcheng Zhang and Zhilong Zheng and Liang Chen and Fei Feng and Yichi Xu and Yikai Zhu and Gang Lu and Xue Li and Zhihui Ren and Zhicheng Wang and Bin Luo and Peng Zhang and Yang Liu and Yanqing Chen and Yu Guan and Weicheng Wang and Chaojie Yang and Yang Zhang and Man Yuan and Hanyu Zhao and Yong Li and Zihan Zhao and Shan Li and Xianlong Zeng and Zhiping Yao and Binzhang Fu and Ennan Zhai and Wei Lin and Chao Wang and Dennis Cai},
title = {Evolution of Aegis: Fault Diagnosis for {AI} Model Training Service in Production},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)},
year = {2025},
isbn = {978-1-939133-46-5},
address = {Philadelphia, PA},
pages = {865--881},
url = {https://www.usenix.org/conference/nsdi25/presentation/dong},
publisher = {USENIX Association},
month = apr
}