### **Tiara: A Scalable and Efficient Hardware Acceleration Architecture for Stateful Layer-4 Load Balancing**

**Chaoliang Zeng**, Layong Luo, Teng Zhang, Zilong Wang, Luyang Li, Wenchen Han, Nan Chen, Lebing Wan, Lichao Liu, Zhipeng Ding, Xiongfei Geng, Tao Feng, Feng Ning, Kai Chen, Chuanxiong Guo





# L4 LB at datacenter boundary



**Real Servers** 

# Being stateful



#### **Real Servers**

# Stateful L4 LB requirements

Driven by exponentially increased content delivery and cloud computing demands, a typical LB in large service providers usually supports

- Terabits per second of Internet traffic
- Tens of millions of concurrent flows
- Millions of new connections per second (CPS)

# Stateful L4 LB requirements

Driven by exponentially increased content delivery and cloud computing demands, a typical LB in large service providers usually supports

- Terabits per second of Internet traffic
- Tens of millions of concurrent flows
- Millions of new connections per second (CPS)

**Existing LBs fail to meet these requirements in a scalable and efficient way** 

# Existing solution: software-based LB



Software-based LB can scale out to support high throughput Ananta [SIGCOMM'13] Maglev [NSDI'16]

Low (cost, energy and space) efficiency

- 10 Gbps/server or 2 Mpps/core
- 100 servers to support 1 Tbps

### High latency and jitter

- 10 us average latency
- up to ms tail latency/jitter

- > Expensive
- Sometimes undeployable in resourceconstrained PoPs or edge DCs
- Sometimes comparable to Internet latency when CPU utilization is high

# Existing solution: switch-based LB



Scalability issue on data plane - 50-100 MB on-chip memory

### Scalability issue on control plane

- 100K entry insertions per second
  - low-end SoC
  - slow PCIe interconnect
  - Cuckoo hash

Leveraging programmable switches can improve efficiency and latency Silkroad [SIGCOMM'17] Cheetah [NSDI'20]

Fail to support a large number of concurrent connections

→ ➢ Fail to support high CPS

# Strawman solution: switch-server LB



Leveraging traffic locality can address scalability issues of switches Serving only a few elephant flows in the switch Serving the rest traffic in the server

Traffic locality assumption

Traffic do not necessarily follow a long-tail distribution.
It is dynamic and unpredictable!

# Traffic at datacenter boundary

### The flow distribution of individual services varies

- Top 10% connections carry 46.3%, 35.5%, and 19.6% traffic in three traces respectively.



# Traffic distribution may not be long-tail!

 Limited memory in switch cannot hold enough connections to serve the majority of traffic

# Traffic at datacenter boundary

### The flow distribution of individual services varies

- Top 10% connections carry 46.3%, 35.5%, and 19.6% traffic in three traces respectively.

### The traffic volume of a service can dynamically change

- Tidal traffic in a single day.
- Uncertainty in long-term due to change of users' interests.

#### The number of VIPs at a datacenter boundary can change over time

- A cluster can increase 3.2x VIPs in 6 months.

### No assumption on traffic distribution at datacenter boundary!

# System goals

Scalable – 10M concurrent connections and 1M CPS

Efficient – high cost, energy, and space efficiency

Generic – no assumption on traffic patterns



# Tiara idea

### LB Functionalities

**Real server selection** 

Stateful memory-intensive

Packet en/decapsulation

Stateless throughput-intensive

# Tiara idea



# Tiara three-tier architecture



# Tiara architecture in details



# Inbound traffic

















# Outbound traffic



# Optimizations

### • Efficient hash table structure

- To enable both fast lookup in T-NIC and fast entry insertion in T-server
- Optimization for throughput, concurrent flow number, and CPS
- Lock-free offloading approach
  - To enable millions of flow offloading operations per second
  - Optimization for CPS
- Lightweight aging mechanism
  - To recycle outdated entries in FPGA HBM
  - Optimization for efficiency

# Prototype implementation

T-switch: Barefoot Tofino switch

• RS Table: 64K entries

T-NIC: Xilinx FPGA-based SmartNIC with two 100GE ports & one HBM stack

• Connection table: 32M entries

T-server: Server with two Intel Xeon Platinum 8260 CPUs running a production SMux

T\_NIC

• SMux CPS: 1.8M

|       |        | 1-1010       |      |        |
|-------|--------|--------------|------|--------|
| T-swi | tch    |              | LUT  | 33.22% |
| SRAM  | 53.85% | Resource     | FF   | 28.46% |
| TCAM  | 13.19% | Untilization | BRAM | 50.93% |
|       |        |              | URAM | 36.72% |

# System performance



# Latency-bounded throughput



# Tiara vs. existing approaches

|            | Throughput | P99 lat. | CPS  | CT size* | Cost efficiency         | Energy efficiency | Space efficiency |
|------------|------------|----------|------|----------|-------------------------|-------------------|------------------|
| SMux       | 38 Gbps    | 100 us   | 1.8M | ~100 GB  | 4.75 Gbps/(cost unit)   | 76 Mbps/Watt      | 19 Gbps/U        |
| Silkroad** | 1.6 Tbps   | < 2 us   | 200K | 100 MB   | 457.14 Gbps/(cost unit) | 2909.1 Mbps/Watt  | 1600 Gbps/U      |
| Tiara      | 1.6 Tbps   | < 4 us   | 1.8M | 4 GB     | 82.05 Gbps/(cost unit)  | 969.7 Mbps/Watt   | 320 Gbps/U       |

17.4x higher cost efficiency, 12.8x higher energy efficiency, and 16.8x higher space efficiency than server-based solution

9x higher CPS and 40x larger connection table size than switch-based solution

# Conclusion

Tiara is a three-tier hardware architecture for stateful L4 LB

- T-switch for stateless packet encap./decap.
- T-NIC for stateful real server selection
- T-server as slow path and make offloading decision

Tiara meets all design goals with high performance

- Scalable
  - Large HBM and efficient hash table for 10M concurrent flows
  - Fast PCIe DMA and lock-free offloading for 1M CPS
- Efficient
  - Specialized hardware for fast path
- Generic
  - No assumption on traffic patterns and fully programmable architecture

#### Contact: czengaf@connect.ust.hk, luo@bytedance.com