{GPU-Disaggregated} Serving for Deep Learning Recommendation Models at Scale

Lingyun Yang; Yongchen Wang; Yinghao Yu; Qizhen Weng; Jianbo Dong; Kan Liu; Chi Zhang; Yanyi Zi; Hao Li; Zechao Zhang; Nan Wang; Yu Dong; Menglei Zheng; Lanlan Xi; Xiaowei Lu; Liang Ye; Guodong Yang; Binzhang Fu; Tao Lan; Liping Zhang; Lin Qu; Wei Wang

Authors:

Lingyun Yang, Hong Kong University of Science and Technology; Yongchen Wang and Yinghao Yu, Alibaba Group; Qizhen Weng, Hong Kong University of Science and Technology; Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, Nan Wang, Yu Dong, Menglei Zheng, Lanlan Xi, Xiaowei Lu, Liang Ye, Guodong Yang, Binzhang Fu, Tao Lan, Liping Zhang, and Lin Qu, Alibaba Group; Wei Wang, Hong Kong University of Science and Technology

Abstract:

Online recommender systems use deep learning recommendation models (DLRMs) to provide accurate, personalized recommendations to improve customer experience. However, efficiently provisioning DLRM services at scale is challenging. DLRMs exhibit distinct resource usage patterns: they require a large number of CPU cores and a tremendous amount of memory, but only a small number of GPUs. Running them in multi-GPU servers quickly exhausts the servers' CPU and memory resources, leaving a large number of unallocated GPUs stranded, unable to utilize by other tasks.

This paper describes Prism, a production DLRM serving system that eliminates GPU fragmentation by means of resource disaggregation. In Prism, a fleet of CPU nodes (CNs) interconnect with a cluster of heterogeneous GPU nodes (HNs) through RDMA, leading to two disaggregated resource pools that can independently scale. Prism automatically divides DLRMs into CPU- and GPU-intensive subgraphs and schedules them on CNs and HNs for disaggregated serving. Prism employs various techniques to minimize the latency overhead caused by disaggregation, including optimal graph partitioning, topology-aware resource management, and SLO-aware communication scheduling. Evaluations show that Prism effectively reduces CPU and GPU fragmentation by 53% and 27% in a crowded GPU cluster. During seasonal promotion events, it efficiently enables capacity loaning from training clusters, saving over 90% of GPUs. Prism has been deployed in production clusters for over two years and now runs on over 10k GPUs.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {306025,
author = {Lingyun Yang and Yongchen Wang and Yinghao Yu and Qizhen Weng and Jianbo Dong and Kan Liu and Chi Zhang and Yanyi Zi and Hao Li and Zechao Zhang and Nan Wang and Yu Dong and Menglei Zheng and Lanlan Xi and Xiaowei Lu and Liang Ye and Guodong Yang and Binzhang Fu and Tao Lan and Liping Zhang and Lin Qu and Wei Wang},
title = {{GPU-Disaggregated} Serving for Deep Learning Recommendation Models at Scale},
booktitle = {22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25)},
year = {2025},
isbn = {978-1-939133-46-5},
address = {Philadelphia, PA},
pages = {847--863},
url = {https://www.usenix.org/conference/nsdi25/presentation/yang},
publisher = {USENIX Association},
month = apr
}

Download

Yang PDF

View the slides

GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale

Open Access Media

Presentation Video