{PopFetcher}: Towards Accelerated {Mixture-of-Experts} Training Via Popularity Based {Expert-Wise} Prefetch

Junyi Zhang; Chuanhu Ma; Xiong Wang; Yuntao Nie; Yuqing Li; Yuedong Xu; Xiaofei Liao; Bo Li; Hai Jin

Authors:

Junyi Zhang, Chuanhu Ma, Xiong Wang, and Yuntao Nie, Huazhong University of Science and Technology; Yuqing Li, Wuhan University; Yuedong Xu, Fudan University; Xiaofei Liao, Huazhong University of Science and Technology; Bo Li, Hong Kong University of Science and Technology; Hai Jin, Huazhong University of Science and Technology

Abstract:

Scaling laws indicate that increasing model size enhances performance. The Mixture-of-Experts (MoE) architecture enables scaling model parameters to trillions while requiring only a sub-linear increase in training computations. However, the sparse activation of experts within MoE leads to substantial All-to-All communications and imbalanced computation workloads, which in turn can severely degrade training efficiency. In this paper, we develop PopFetcher, a scalable MoE training system with popularity-aided expert-wise prefetching, to address these communication and computation bottlenecks. Specifically, PopFetcher uncovers skewed and correlated patterns in expert selection, and implements a lightweight sliding-window technique to accurately predict the popularity of experts. As a result, PopFetcher facilitates dynamic identification of high-demand experts and prefetches them in the next layer during the execution of current non-MoE computations, thereby exploiting the idle network links to reduce dispatched tokens in upcoming All-to-All communications. PopFetcher rigorously formulates the end-to-end training latency and develops a tailored pruning strategy to derive the globally optimal prefetching scheme, which can restore both communication and computation balances based on the underlying network infrastructure. By prioritizing All-to-All data stream during the backward pass, PopFetcher significantly alleviates the communication blockage. Extensive experiments conducted on GPU clusters demonstrate that PopFetcher outperforms existing state-of-the-art systems, reducing training time by 15%-94.5%.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Zhang-Junyi PDF

PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch

Open Access Media