

# Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Vishakha Gupta, Karsten Schwan @ Georgia Tech Niraj Tolia @ Maginatics Vanish Talwar, Parthasarathy Ranganathan @ HP Labs

USENIX ATC 2011 – Portland, OR, USA

## **Increasing Popularity of Accelerators**















C-like CUDA-based applications (host portion)

**Proprietary** NVIDIA Driver and CUDA runtime

- Memory management
- Communication with device
- Scheduling logic
- Binary translation







C-like CUDA-based applications

**Proprietary** NVIDIA Driver and

- Memory management
- Communication with device
- Scheduling logic
- Binary translation







**Design flaw:** Bulk of logic in drivers which were meant to be for simple operations like read, write and handle interrupts **Shortcoming:** Inaccessibility and one scheduling fits all

LABShp

### 2010

- Amazon EC2 adopts GPUs
- Other cloud offerings by AMD, NVIDIA

#### 2011

- Tegras in cellphones
- HPC GPU Cluster (Keeneland)



#### 2010

- Amazon EC2 adopts GPUs
- Other cloud offerings by AMD, NVIDIA

#### 2011

- Tegras in cellphones
- HPC GPU Cluster (Keeneland)
- Most applications fail to occupy GPUs completely
  - With the exception of extensively tuned (e.g. supercomputing) applications



### 2010

- Amazon EC2 adopts GPUs
- Other cloud offerings by AMD, NVIDIA

#### 2011

- Tegras in cellphones
- HPC GPU Cluster (Keeneland)
- Most applications fail to occupy GPUs completely
  - With the exception of extensively tuned (e.g. supercomputing) applications
- Expected utilization of GPUs across applications in some domains "may" follow patterns to allow sharing



### 2010

- Amazon EC2 adopts GPUs
- Other cloud offerings by AMD, NVIDIA

### 2011

- Tegras in cellphones
- HPC GPU Cluster (Keeneland)
- Most applications fail to occupy GPUs completely
  - With the exception of extensively tuned (e.g. supercomputing) applications
- Expected utilization of GPUs across applications in some domains "may" follow patterns to allow sharing

**Need for accelerator sharing:** resource sharing is now supported in NVIDIA's Fermi architecture **Concern:** Can driver scheduling do a good job?



# NVIDIA GPU Sharing – Driver Default



- Xeon Quadcore with 2 8800GTX NVIDIA GPUs, driver 169.09, CUDA SDK 1.1
- Coulomb Potential [CP] benchmark from parboil benchmark suite
- Result of sharing two GPUs among four instances of the application



# NVIDIA GPU Sharing – Driver Default



- Xeon Quadcore with 2 8800GTX NVIDIA GPUs, driver 169.09, CUDA SDK 1.1
- Coulomb Potential [CP] benchmark from parboil benchmark suite
- Result of sharing two GPUs among four instances of the application

Driver can: efficiently implement computation and data interactions between host and accelerator Limitations: Call ordering suffers when sharing – any scheme used is static and cannot adapt to different system expectations



### Accelerators as first class citizens

- Why treat such powerful processing resources as devices?
- How can such heterogeneous resources be managed especially with evolving programming models, evolving hardware and proprietary software?



### Accelerators as first class citizens

- Why treat such powerful processing resources as devices?
- How can such heterogeneous resources be managed especially with evolving programming models, evolving hardware and proprietary software?

### Sharing of accelerators

- Are there efficient methods to utilize a heterogeneous pool of resources?
- Can applications share accelerators without a big hit in efficiency?



### Accelerators as first class citizens

- Why treat such powerful processing resources as devices?
- How can such heterogeneous resources be managed especially with evolving programming models, evolving hardware and proprietary software?

### Sharing of accelerators

- Are there efficient methods to utilize a heterogeneous pool of resources?
- Can applications share accelerators without a big hit in efficiency?
- Coordination across different processor types
  - How do you deal with multiple scheduling domains?
  - Does coordination obtain any performance gains?





(Demonstrated through x86--NVIDIA GPU-based systems)



(Demonstrated through x86--NVIDIA GPU-based systems)

It leverages new opportunities presented by increased adoption of virtualization technology in commercial, cloud computing, and even high performance infrastructures.



(Demonstrated through x86--NVIDIA GPU-based systems)

It leverages new opportunities presented by increased adoption of virtualization technology in commercial, cloud computing, and even high performance infrastructures. (Virtualization provided by Xen hypervisor and Dom0 management domain)



# ACCELERATORS AS FIRST CLASS CITIZENS





### Hypervisor (Xen)

General purpose multicores

**Traditional Devices** 





### Hypervisor (Xen)

General purpose multicores

Compute Accelerators (NVIDIA GPUs)

**Traditional Devices** 























VM











CUDA Runtime + Driver





Hence, we define an "accelerator" virtual CPU or aVCPU

## **First Class Citizens**



- The aVCPU has execution context on both, CPU (polling thread, runtime, driver context) and GPU (CUDA kernel)
- It has data used by these calls



## **First Class Citizens**



- The aVCPU has execution context on both, CPU (polling thread, runtime, driver context) and GPU (CUDA kernel)
- It has data used by these calls

VCPU: first class schedulable entity on a physical CPU aVCPU: first class schedulable entity on GPU (with a CPU component due to execution model)

Manageable pool of heterogeneous resources



# **SHARING OF ACCELERATORS**






**RR:** Fair

share

aVCPUs are given equal time slices and scheduled in a circular fashion



Georgia College of Tech

Too fine

Per call

granularity



Adopt Xen credit scheduling for aVCPU scheduling. E.g. VMs 1, 2 and 3 have 256, 512, 1024 credits, they get 1, 2, 4 time ticks respectively, every scheduling cycle







Instead of using the assigned VCPU credits for scheduling aVCPUs, define new accelerator credits. These could be some fraction of CPU credits



Too coarse

Per application

granularity





# Performance Improves but Still High Variation





# Performance Improves but Still High Variation



**Still high variation:** due to the hidden driver and runtime **Coordination:** Can we do better?

# COORDINATION ACROSS SCHEDULING DOMAINS



## Coordinating CPU-GPU Scheduling

### Hypervisor co-schedule [CoSched]

- Hypervisor scheduling determines which domain should run on a GPU depending on the CPU schedule
- Latency reduction by occasional unfairness
- Possible waste of resources e.g. if domain picked for GPU has no work to do



## Coordinating CPU-GPU Scheduling

### Hypervisor co-schedule [CoSched]

- Hypervisor scheduling determines which domain should run on a GPU depending on the CPU schedule
- Latency reduction by occasional unfairness
- Possible waste of resources e.g. if domain picked for GPU has no work to do
- Augmented credit [AugC]
  - Scan the hypervisor CPU schedule to temporarily boost credits of domains selected for CPUs
  - Pick domain(s) for GPU(s) based on GPU credits + remaining CPU credits from hypervisor (augmenting)
  - Throughput improvement by temporary credit boost



# Coordination Further Improves Performance





# Coordination Further Improves Performance



**Coordination:** Aligning the CPU and GPU portions of an application to run almost simultaneously, reduces variation and improves performance

## **Pegasus Scheduling Policies**

### No coordination:

- Default GPU driver based base case (None)
- Round Robin (RR)
- AccCredit (AccC) credits based on static profiling
- Coordination based:
  - XenCredit (XC) use Xen CPU credits
  - SLA feedback based (SLAF)
  - Augmented Credit based (AugC) temporarily augment credits for co-scheduling
- Controlled
  - HypeControlled or coscheduled (CoSched)



## **Pegasus Scheduling Policies**

- No coordination:
  - Default GPU driver based base case (None)
  - Round Robin (RR)
  - AccCredit (AccC) credits based on static profiling
- Coordination based:
  - XenCredit (XC) use Xen CPU credits
  - SLA feedback based (SLAF)
  - Augmented Credit based (AugC) temporarily augment credits for co-scheduling
- Controlled
  - HypeControlled or coscheduled (CoSched)





## **Pegasus Scheduling Policies**

- No coordination:
  - Default GPU driver based base case (None)
  - Round Robin (RR)
  - AccCredit (AccC) credits based on static profilin
- Coordination based:
  - XenCredit (XC) use Xen CPU credits
  - SLA feedback based (SLAF)
  - Augmented Credit based (AugC) temporarily augment credits for co-scheduling
- Controlled
  - HypeControlled or coscheduled (CoSched)

Scheduling simplicity

ncreasing

Coordination



















### **Testbed Details**

- Xeon 4 core @3GHz, 3GB RAM, 2 NVIDIA GPUs G92-450
- Xen 3.2.1 stable, Fedora 8 Dom0 and DomU running Linux kernel 2.6.18, NVIDIA driver 169.09, SDK 1.1
- Guest domains given 512M memory and 1 core mostly
  - Pinned to different physical cores
  - Launched almost simultaneously: worst case measurement due to maximum load
- Data currently sampled over 50runs for statistical significance despite driver/runtime variation
- Scheduling plots report h-spread with min-max over 85% readings or total work done over all runs in an experiment



| Category            | Source      | Benchmarks                                                                 |
|---------------------|-------------|----------------------------------------------------------------------------|
| Financial           | SDK         | Binomial (BOp), BlackScholes (BS),<br>MonteCarlo (MC)                      |
| Media<br>processing | SDK/parboil | ProcessImage(PI)=matrix<br>multiply+DXTC, MRIQ,<br>FastWalshTransform(FWT) |
| Scientific          | Parboil     | CP, TPACF, RPES                                                            |



| Category            | Source      | Benchmarks                                                                 |
|---------------------|-------------|----------------------------------------------------------------------------|
| Financial           | SDK         | Binomial (BOp), BlackScholes (BS),<br>MonteCarlo (MC)                      |
| Media<br>processing | SDK/parboil | ProcessImage(PI)=matrix<br>multiply+DXTC, MRIQ,<br>FastWalshTransform(FWT) |
| Scientific          | Parboil     | CP, TPACF, RPES                                                            |

Diverse benchmarks: from different application domains show 
 (a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls



| Category            | Source      | Benchmarks                                                                 |
|---------------------|-------------|----------------------------------------------------------------------------|
| Financial           | SDK         | Binomial (BOp), BlackScholes (BS),<br>MonteCarlo (MC)                      |
| Media<br>processing | SDK/parboil | ProcessImage(PI)=matrix<br>multiply+DXTC, MRIQ,<br>FastWalshTransform(FWT) |
| Scientific          | Parboil     | CP, TPACF, RPES                                                            |

- Diverse benchmarks: from different application domains show 
   (a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls
- BlackScholes worst in the set: Throughput + latency sensitive due to large number of CUDA calls (depending on iteration)



| Category            | Source      | Benchmarks                                                                 |
|---------------------|-------------|----------------------------------------------------------------------------|
| Financial           | SDK         | Binomial (BOp), BlackScholes (BS),<br>MonteCarlo (MC)                      |
| Media<br>processing | SDK/parboil | ProcessImage(PI)=matrix<br>multiply+DXTC, MRIQ,<br>FastWalshTransform(FWT) |
| Scientific          | Parboil     | CP, TPACF, RPES                                                            |

- Diverse benchmarks: from different application domains show 
   (a) different throughput and latency constraints, (b) varying data and CUDA kernel sizes and (c) different number of CUDA calls
- BlackScholes worst in the set: Throughput + latency sensitive due to large number of CUDA calls (depending on iteration)
- Latency sensitive FastWalshTransform: multiple computation kernel launches and large data transfer

### Ability to Achieve Low Virtualization Overhead



### Appropriate Scheduling is Important





### **Appropriate Scheduling is Important**




#### **Appropriate Scheduling is Important**



Without resource management, calls can be variably delayed due to interference from other application(s)/domain(s), even in the absence of virtualization



#### Pegasus Scheduling Black Scholes – Latency and throughput sensitive







Pegasus approach efficiently virtualizes GPUs



- Pegasus approach efficiently virtualizes GPUs
- Coordinated scheduling is effective
  - Even basic accelerator request scheduling can improve sharing performance
  - While co-scheduling is really useful [CoSched], other methods can come close [AugC], keep up utilization and give desirable properties



- Pegasus approach efficiently virtualizes GPUs
- Coordinated scheduling is effective
  - Even basic accelerator request scheduling can improve sharing performance
  - While co-scheduling is really useful [CoSched], other methods can come close [AugC], keep up utilization and give desirable properties
- Scheduling lowers degree of variability caused by uncoordinated use of the NVIDIA driver.

- Pegasus approach efficiently virtualizes GPUs
- Coordinated scheduling is effective
  - Even basic accelerator request scheduling can improve sharing performance
  - While co-scheduling is really useful [CoSched], other methods can come close [AugC], keep up utilization and give desirable properties
- Scheduling lowers degree of variability caused by uncoordinated use of the NVIDIA driver.

No single `best' scheduling policy Clear need for diverse policies geared to match different system goals and to account for different application characteristics



#### Conclusion

 We successfully virtualize GPUs to convert them into first class citizens



#### Conclusion

- We successfully virtualize GPUs to convert them into first class citizens
- Pegasus approach abstracts accelerator interfaces through CUDA-level virtualization
  - Devise scheduling methods that coordinate accelerator use with that of general purpose host cores
  - Performance evaluated on x86-GPU Xen-based prototype



#### Conclusion

- We successfully virtualize GPUs to convert them into first class citizens
- Pegasus approach abstracts accelerator interfaces through CUDA-level virtualization
  - Devise scheduling methods that coordinate accelerator use with that of general purpose host cores
  - Performance evaluated on x86-GPU Xen-based prototype
- Evaluation with a variety of benchmarks shows
  - Need for coordination when sharing accelerator resources, especially for applications with high <u>CPU-GPU coupling</u>
  - Need for diverse policies when coordinating resource management decisions made for general purpose vs. accelerator core



- Applicability: concepts applicable to open as well as close accelerators due lack of integration with runtimes
  - Past experience with IBM Cell accelerator [Cellule]
  - Open architecture allows finer grained control of resources



- Applicability: concepts applicable to open as well as close accelerators due lack of integration with runtimes
  - Past experience with IBM Cell accelerator [Cellule]
  - Open architecture allows finer grained control of resources
- Toolchains: sophistication through integration
  - Instrumentation support from Ocelot [GTOcelot]
  - Improve admission control, load balancing and scheduling



- Applicability: concepts applicable to open as well as close accelerators due lack of integration with runtimes
  - Past experience with IBM Cell accelerator [Cellule]
  - Open architecture allows finer grained control of resources
- **Toolchains**: sophistication through integration
  - Instrumentation support from Ocelot [GTOcelot]
  - Improve admission control, load balancing and scheduling
- Heterogeneous platforms: Scheduling different personalities for a virtual machine [Poster session]
  - More generic problem where even processing resources on the same chip can be asymmetric



- Applicability: concepts applicable to open as well as close accelerators due lack of integration with runtimes
  - Past experience with IBM Cell accelerator [Cellule]
  - Open architecture allows finer grained control of resources
- **Toolchains**: sophistication through integration
  - Instrumentation support from Ocelot [GTOcelot]
  - Improve admission control, load balancing and scheduling
- Heterogeneous platforms: Scheduling different personalities for a virtual machine [Poster session]
  - More generic problem where even processing resources on the same chip can be asymmetric
- Scale: Extensions to cluster-based systems with Shadowfax [VTDC`11]

### **Related Work**

- Heterogeneous and larger-scale systems [Helios], [MultiKernel]
- Scheduling extension [Cypress], [Xen Credit Scheduling], [QoS Adaptive Communication], [Intel Shared ISA Heterogeneity], [Cellular Disco]
- GPU Virtualization: [OpenGL], [VMWare DirectX], [VMGL], [vCUDA], [gVirtuS]
- Other related work
  - Accelerator Frontend or multi-core programming models: [CUDA],
    [Georgia Tech Harmony], [Georgia Tech Cellule], [OpenCL]
  - Some examples: [Intel Tolapai], [AMD Fusion], [LANL Roadrunner]
  - Application domains: [NSF Keeneland], [Amazon Cloud]
  - Interaction with higher levels: [PerformancePointsOSR]
  - Cluster level: [rCUDA], [Shadowfax]



# Thank you!



