



### Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Yanhao Chen<sup>1</sup>, Ari B. Hayes<sup>1</sup>, Chi Zhang<sup>2</sup>, Timothy Salmon<sup>1</sup>, Eddy Z. Zhang<sup>1</sup>

Rutgers University
University of Pittsburgh





#### **Sparse Matrix**

- Sparse Linear Systems
  - CG - GMRES
- Physics Based Simulations
  - CFD
- Deep Learning Optimizations
  - Pruned Neural Networks

. . . . . .





# **Sparse Matrix Operation** $\mathbf{y} = \mathbf{A} \times \mathbf{y}$ Input Vector **Output Vector Sparse Matrix** $y_i = reduce \{ A_{ik} \odot x_k \}$ $i \in [1, ..., m], k \in [1, ..., n] \uparrow$ **Binary Operator**





### **Sparse Matrix Vector Multiplication (SpMV)**

 $y_i = reduce\{A_{ik} \odot x_k\}$ 

 $\boldsymbol{x_1}$  $A_{1,1} | A_{1,2} | A_{1,3} | A_{1,4}$  $y_1$ *reduce* = sum  $\boldsymbol{x_2}$  $y_2$ 0 0 0 0 \* | A<sub>3,3</sub> |  $\boldsymbol{x_3}$  $y_3$ *A*<sub>3,1</sub> 0 0  $x_4$ *y*<sub>4</sub>  $A_{4,2}$ 0 0 *A*<sub>4.4</sub>  $y_i = \mathbf{sum}\{A_{i,k} * x_k\}$ X Α У  $i \in [1, ..., m], k \in [1, ..., n]$  $y_3 = A_{3,1} * x_1 + A_{3,3} * x_3$ 



#### Single Source Shortest Path (SSSP)



2018 USENIX Annual Technical Conference

JTGERS





### **Problem with Sparsity on GPUs**

• Low data reuse is always a big problem

$$y_i = \mathbf{sum}\{A_{i,k} * x_k\}$$

- e.g. SpMV
  - The input vector and the output vector can be reused a lot
  - They are usually too large to fit into GPU's cache
  - The **sparsity** of the matrix causes irregular accesses of the vectors
  - This means low reuse of the data in the cache





#### **Existing Methods to Improve Data Reuse on GPUs**

#### Warp Scheduling Policy

- Throttling concurrent threads
- Limits the number of active warps [Rogers+, MICRO'12]
- DYNCTA: controls the number of CTAs [Kayiran+,PACT'13]

#### **Need Hardware Modification!**

#### Computation and Data Layout Transformation

- Reduce irregular memory accesses
- Improve Memory Coalescing [Zhang+, ICS'10]

#### **Only focus on Spatial Data Reuse!**





### Our Throttling framework for GPUs...

- Is the First Software Throttling implementation
- Is focused on Temporal Data Reuse
- Exploits the Trade-off between throttling performance and GPU throughput











Matrix A is bypassing the cache











Matrix A is bypassing the cache

$$\{ < x_1 \ y_1 > < x_2 \ y_1 > < x_3 \ y_1 > < x_4 \ y_1 > \\ < x_1 \ y_3 > < x_3 \ y_3 > < x_2 \ y_4 > < x_4 \ y_4 > \}$$

Running at one time





Matrix A is bypassing the cache



Cache Capacity: 4



$$\{ < x_1 \ y_1 > < x_2 \ y_1 > < x_3 \ y_1 > < x_4 \ y_1 > \\ < x_1 \ y_3 > < x_3 \ y_3 > < x_2 \ y_4 > < x_4 \ y_4 > \}$$

#### Running at one time





Time

### **SpMV with Software Throttling**



Cache Capacity: 4



| $< x_1 y_1 > < x_1 y_3 > < x_3 y_1 >$ |  |
|---------------------------------------|--|
| $< x_2 y_1 > < x_2 y_4 > < x_4 y_1 >$ |  |









Cache Capacity: 4



 $< x_1 y_1 > < x_1 y_3 > < x_3 y_1 > < x_3 y_3 >$  $< x_2 y_1 > < x_2 y_4 > < x_4 y_1 > < x_4 y_4 >$ 

0

 $A_{3.1}$ 

0

#### $A_{1.1} | A_{1.2} | A_{1.3} | A_{1.4}$ $x_1$ *y*<sub>1</sub> *y*<sub>2</sub> $x_2$ 0 0 0 Х = 0 |A<sub>3,3</sub>| $x_3$ $y_3$ 0 0 | A<sub>4,4</sub> | **y**<sub>4</sub> $x_4$ A<sub>4.2</sub> Х Υ Α

Time

#### All Data Can Fit into the Cache





Time

### SpMV with Software Throttling



Cache Capacity: 4

2018 USENIX Annual Technical Conference



0

 $A_{3.1}$ 

0

0

0

 $A_{4.2}$ 

Α







Time

### SpMV with Software Throttling



Cache Capacity: 4



 $< x_1 y_1 > < x_1 y_3 > < x_3 y_1 > < x_3 y_3 >$  $< x_2 y_1 > < x_2 y_4 > < x_4 y_1 > < x_4 y_4 >$ 

# All Data Can Fit into the Cache





- An effective partitioning algorithm
  - All data items in one partition can fit into the cache
  - The interaction between different partitions are minimized
- Applicable scheduling methods
  - Handle the trade-off between throttling and throughput

FGERS



- An effective partitioning algorithm
  - All data items in one partition can fit into the cache
  - The interaction between different partitions are minimized
- Applicable scheduling methods
  - Handle the trade-off between throttling and throughput

TGERS





#### **Graph Representation**

- Graph Edge Partition Model
  - Places an emphasis on Data
  - Node  $\rightarrow$  Data object
  - Edge  $\rightarrow$  Interaction between data









### Why Edge Partition Model?

- 1. Better load balancing
  - PowerGraph [OSDI'12], Streaming Edge Partition [KDD'14], SPAC [SIGMETRICS'17]
    - Balanced vertex partition is sometimes **NOT equal** to balanced workload
- 2. **Quantifying** the communication cost
- 3. Applies to a large class of parallel applications
  - N-body, CFD, Sparse Linear Algebra, Graph Analytics, ...





X

# Nodes: 4

#### Different Edge Partition Models





**Cache-Fit** 

**Partition** 



TGERS

- Recursive **Bisection** Framework
  - 2-way Load Balanced Edge Partition
    - **SPAC** [Li+,SIGMETRICS'17]
  - Minimize vertex replica (data copy)







TGERS

- Recursive **Bisection** Framework
  - 2-way Load Balanced Edge Partition
    - **SPAC** [Li+,SIGMETRICS'17]
  - Minimize vertex replica (data copy)







- Recursive **Bisection** Framework
  - 2-way Load Balanced Edge Partition
    - **SPAC** [Li+,SIGMETRICS'17]
  - Minimize vertex replica (data copy)







- Recursive **Bisection** Framework
  - 2-way Load Balanced Edge Partition
    - **SPAC** [Li+,SIGMETRICS'17]
  - Minimize vertex replica (data copy)







- Recursive **Bisection** Framework
  - 2-way Load Balanced Edge Partition
    - **SPAC** [Li+,SIGMETRICS'17]
  - Minimize vertex replica (data copy)







### What We Need for Software Throttling

- A good partitioning algorithm
  - All data items in one partition can fit into the cache
  - The interaction between different partitions are minimized
- Applicable scheduling methods
  - Handle the trade-off between throttling and throughput





#### **Effective scheduling methods**

- Four different scheduling methods
  - Cache-Fit (CF)
  - Cache-Fit Queue (CF-Q)
  - Split-Join (SJ)
  - Split-Join Queue (SJ-Q)





#### **Effective scheduling methods**

- Four different scheduling methods
  - Cache-Fit (CF)
  - Cache-Fit Queue (CF-Q)
  - Split-Join (SJ)
  - Split-Join Queue (SJ-Q)





### Cache-Fit (CF) Scheduling

- Isolate the computation of different Cache-Fit Partitions
- Run one Cache-Fit Partition at one time

**CUDA** Function

*Original:* Kernel<<<blocknum, blockdim>>(**TL**, **N**);

- *Phase 1:* Kernel<<<blocknum, blockdim>>(**TL'**[0], **P**<sub>0</sub>);
- Phase 2: Kernel<<<blocknum, blockdim>>(TL'[1], P1);
- *Phase k:* Kernel<<<blocknum, blockdim>>(**TL'**[k], **P**<sub>k</sub>);

TL: tuple list N: # of tuples TL': new tuple list P<sub>i</sub>: # of tuples in TL[i]

**Strict** 

**Barriers** 





#### **Low Pipeline Utilization**



#### Cache Capacity: 4

#### 4 Working Threads



#### Kernel 1 -- Partition A





#### Low Throughput



#### 4 Working Threads



Cache Capacity: 4

Kernel 2 -- Partition B





#### **Low Pipeline Utilization**



#### Cache Capacity: 4

4 Working Threads

## low pipeline utilization



#### Kernel 3 -- Partition C





#### **Effective scheduling methods**

- Four different scheduling methods
  - Cache-Fit (CF)
  - Cache-Fit Queue (CF-Q)
  - Split-Join (SJ)
  - Split-Join Queue (SJ-Q)



- Invoke a single kernel call but still enable throttling
- Set up a FIFO queue

TGERS

- Each entry corresponds to a **chunk** 
  - A chunk is part of a cache-fit partition
  - Adjacent chunks are from the same Cache-Fit Partition
- Each warp fetches a chunk from the queue and processes it











#### Cache-Fit Queue (CF-Q) Scheduling cont.







# **Effective scheduling methods**

- Four different scheduling methods
  - Cache-Fit (CF)
  - Cache-Fit Queue (CF-Q)
  - Split-Join (SJ)
  - Split-Join Queue (SJ-Q)





# Split-Join (SJ) Scheduling

- Dynamically merge Cache-Fit Partitions
- Perform an Online Profiling to decide which partitions should be merged
- Use the Tree Representation of the data balanced partition to help the Online Profiling





# Split-Join (SJ) Scheduling



Cache Capacity: 4





# Split-Join (SJ) Scheduling







# Split-Join (SJ) Scheduling All.stime > A.btime + BC.btime





- Four different scheduling methods
  - Cache-Fit (CF)

JTGERS

- Cache-Fit Queue (CF-Q)
- Split-Join (SJ)
- Split-Join Queue (SJ-Q)



- Provide strict barriers between different merged partitions
- No barriers inside a merged partition of SJ
  - No guarantee of the execution order
- Set up one FIFO queue for each merged partition
  - Provide relaxed barriers between cache-fit partitions from the same merged partition

TGERS





## Four Scheduling Methods Summary

| Methods | Pipeline<br>Utilization | Profiling | Barriers         | Queue | Code<br>Change |
|---------|-------------------------|-----------|------------------|-------|----------------|
| CF      | Low                     | No        | Strict           | No    | No             |
| CF-Q    | High                    | No        | Relaxed          | Yes   | Yes            |
| SJ      | High                    | Yes       | Strict           | No    | No             |
| SJ-Q    | High                    | Yes       | Strict / Relaxed | Yes   | Yes            |





# **Software Throttling Performance**

• Experiment Settings

| GPU Model    | Titan X               | GTX 745               |  |
|--------------|-----------------------|-----------------------|--|
| Architecture | Pascal                | Maxwell               |  |
| Core #       | 5376                  | 576                   |  |
| L2 Cache     | 3MB                   | 2MB                   |  |
| CUDA Version | CUDA 8.0              | CUDA 8.0              |  |
| CPU          | Intel Xeon<br>E5-2620 | Intel Core<br>i7-4790 |  |





#### **Benchmarks**

- Sparse Linear Algebra Workloads
  - Sparse Matrix Vector Multiplication (SPMV)
- Graph Processing Workloads
  - Bellman-Ford (BMF)
  - Page Rank (PR)
- Neural Network Optimization
  - Deep Compression: Pruned AlexNet



Baseline: CUSP

TGERS

- Matrices: Florida Sparse Matrix Collection
  - Focus on large matrices: working set cannot fit into L2 cache
  - 228 large matrices on GTX 745
  - 192 large matrices on Titan X





#### **Overall SPMV Performance**







#### **Effect of Cache Hit Rate**







## Effect of Working Set Size







#### **Graph Application Performance**

BMF & PR Performance using SJ w/ overhead



<sup>2018</sup> USENIX Annual Technical Conference





#### **Graph Application Performance cont.**







# **Deep Learning Benchmark**

- Deep Compression [Han+,ICLR'16]
  - Prune AlexNet to remove low weight elements in fully connected layers
  - Deep Compression provide us sparse matrices







#### Conclusion

- We proposed the first locality-aware Software Throttling framework for GPUs
- Our framework can increase data reuse by improving Temporal Locality
- We exploited the **Trade-off** between cache performance and pipeline utilization





# **Questions?**