Primus: Unified Training System for {Large-Scale} Deep Learning Recommendation Models

Jixi Shan; Xiuqi Huang; Yang Guo; Hongyue Mao; Ho-Pang Hsu; Hang Cheng; Can Wang; Jun Song; Rui Shi; Xiaofeng Gao; Jingwei Xu; Shiru Ren; Jiaxiao Zheng; Hua Huang; Lele Yu; Peng Xu; Guihai Chen

Authors:

Jixi Shan, ByteDance Inc.; Xiuqi Huang, Zhejiang University; Yang Guo, Hongyue Mao, Ho-Pang Hsu, Hang Cheng, Can Wang, and Jun Song, ByteDance Inc.; Rui Shi, Bytedance Inc.; Xiaofeng Gao, Shanghai Jiao Tong University; Jingwei Xu, Shiru Ren, Jiaxiao Zheng, Hua Huang, Lele Yu, and Peng Xu, ByteDance Inc.; Guihai Chen, Shanghai Jiao Tong University

Abstract:

The scale of deep learning recommendation models (DLRM) continues to grow, demanding increasingly vast computing and storage resources. In production environments, improving training efficiency and effectiveness has become the primary goal to meet the needs of numerous model training jobs under resource limitations. We introduce Primus, a unified training system that unifies the training resources, data, and paradigms to support high-performance DLRM training at ByteDance. Specifically, ① Primus provides a unified abstraction of resources and interoperates with multiple scheduling systems, achieving a consistent training experience with horizontal and vertical dynamic scaling strategies across resource pools. ② Primus offers a unified three-tier data definition and employs a data task graph generation approach to support data orchestration of multi-source training samples composed of batch and stream data. ③ Primus devises a new hybrid training paradigm for DLRMs that ensures high model timeliness by controlling parameter updates and applying fine-grained prioritization of mixed batch and stream data.

Primus has demonstrated its efficiency and effectiveness in handling large-scale, enterprise-grade DLRM training over five years of deployment at ByteDance. Evaluations show Primus’s optimizations of resources, data, and paradigms. Firstly, dynamic scaling reduces training cost by 17.1% at the cluster level and increases CPU utilization from 50% to 80% per job. Secondly, data orchestration accelerates task generation by 23× and achieves higher training throughput. Lastly, after applying the hybrid training paradigm with 4 different DLRMs, advertising revenue increases by 0.4%-2.4%.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Shan-Jixi PDF

Primus: Unified Training System for Large-Scale Deep Learning Recommendation Models

Open Access Media