{GeneralSparse}: Bridging the Gap in {SpMM} for Pruned Large Language Model Inference on {GPUs}

Yaoyu Wang; Xiao Guo; Junmin Xiao; De Chen; Guangming Tan

Authors:

Yaoyu Wang, Xiao Guo, Junmin Xiao, De Chen, and Guangming Tan, SKLP, Institute of Computing Technology, CAS; and University of Chinese Academy of Sciences

Abstract:

The rapid growth of generative model parameters poses challenges in deployment, especially regarding weight storage and inference latency. The weight pruning is an effective technique to reduce the computational and memory overhead of Large Language Models (LLMs) while maintaining accuracy, which transforms the matmuls to Sparse Matrix Multiplication (SpMM) computation. However, the diverse pruning methods introduce varying sparsity patterns that challenge high-performance SpMM on GPUs. Existing solutions are limited with adaptability to these patterns, flexibility in handling different sparsity levels, and support for efficient optimizations.

In this work, we present GeneralSparse, a novel solution that bridges this gap by leveraging the abstraction of memory access and reduction spaces. GeneralSparse designs the process of dividing box to adapt dynamically to diverse pruning patterns and proposes hierarchical reduction algorithms tailored to GPU hierarchies. Through evaluations on pruned LLM weight matrices and the SuiteSparse collection, GeneralSparse achieves up to 20.82× speedup over cuSPARSE libraries. At end-to-end inference time on LLMs, GeneralSparse achieves up to 2.33× speedup over counterparts.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Wang-Yaoyu PDF

GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs

Open Access Media