

### Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache

### Xingda Wei, Rong Chen, Haibo Chen







### KVS: key pillar for distributed systems

- Important building block for
- **9**: Databases, GraphStore
- **9**: Web applications
- **?**: Cloud infrastructures
- **9**: Serverless platforms









### KVS: key pillar for distributed systems



























### Server CPU is becoming the bottleneck

- Increasing **CPU-NIC** gap



[1] Credits: StRoM: Smart Remote Memory @ Eurosys'20



### **Opportunity: one-sided RDMA (Client-direct)** NIC directly reads/writes memory Server KVS B+Tree Offload index traversal to NIC **?** Totally bypass server CPU $(K_1, V_1)$ Key-values $Get(K_1) = V_1$ Network CPU RNIC RNIC





#### **Opportunity: one-sided RDMA (Client-direct)** NIC directly reads/writes memory Server KVS Offload index traversal to NIC B+Tree **?** Totally bypass server CPU $(K_1, V_1)$ Key-values $Get(K_1) = V_1$ Network RNIC CPU CPU RNIC





Challenge: limited NIC abstraction NIC only has *simple* abstractions **9**: e.g., memory *read/write* Works well for simple index structure **9**: e.g. *HashTable*, **0(1)** network RTT<sup>[1]</sup> Inferior for complex index structure **9**: e.g., **B+Tree**, **O(log(n))**<sup>[2]</sup> network RTT

[1] RTT: roundtrip time[2] n:the scale of the KVS





# Challenge: limited NIC abstraction Client Get(K<sub>1</sub>) CPU RNIC









### Challenge: limited NIC abstraction





# Existing systems adopt caching Client Get(K<sub>1</sub>) RNIC CPU ----





### Existing systems adopt caching



### Cache Tree at clients

#### **9:** FaRM@SOSP'15, SIGMOD'19 Cell@ATC'16

Cache hash table

**9:** DrTM@SOSP'15

Key-values

Network ---- RNIC



### Existing systems adopt caching



### Cache Tree at clients

#### **9:** FaRM@SOSP'15, SIGMOD'19 Cell@ATC'16

Cache hash table

#### B+Tree index has huge client memory cost!

Network ---- RNIC



## High cache miss cost for caching tree Tree node size can be *much larger* than the KV **)**: e.g., 1K vs. 8B **Recursive invalidation** under insertions **?** When cache more tree layers



Throughput(Mreqs/sec) (YCSB-D uniform)



## Trade-off of existing KVS Server-centric KVS **9** High CPU utilizations







## Trade-off of existing KVS Server-centric KVS **P**: High CPU utilizations **Client-direct KVS ?** Poor performance







## Trade-off of existing KVS Server-centric KVS **P**: High CPU utilizations **Client-direct KVS ?** Poor performance Client-direct KVS + cache High memory usage





Client memory



### Trade-off of existing KVS Server-centric KVS Server CPU **P**: High CPU utilizations Can we achieve all these properties ? **Client-direct KVS Client-dire Poor performance** Client-direct KVS + cache ient-direct + cache High memory usage





Client memory





## Trade-off of existing KVS Server-centric KVS **?** High CPU utilizations **Client-direct KVS ?** Poor performance Client-direct KVS + cache High memory usage







### **Overview of XSTORE**

- Hybrid architecture<sup>[1]</sup>
- **Sever-centric** updates
- O(1) Client-direct Get, Scan



[1] Similar to existing RDMA-based KVS, e.g., FaRM@SOSP'15, Cell@ATC'16



Our approach: Learned cache Using ML as the cache structure for tree-based index Motivated by the *learned index*<sup>[1]</sup> **?** Replace index traversal with calculation **Solution:** The ML model can be orders of magnitude smaller than tree

Key

#### Machine Learning (ML) models

[1] The case for the learned index @ SIGMOD'18

















## Client-direct Get() using learned cache Client Get(K<sub>1</sub>) [0,1]RNIC CPU ----





## Client-direct Get() using learned cache Client Get(K<sub>1</sub>) [0,1]CPU RNIC - - -

























### Server-side data structure for dynamic workloads

### Client-side learned cache & TT

### Performance evaluation of XSTORE





# Models cannot learn dynamic B+Tree address Can only learn when the addresses are **sorted** Not the case for dynamic B<sup>+</sup>Tree







# Solution: another layer of indirection **Observation:** leaf nodes are logically sorted Assign logical addresses to leaf nodes ML: key — logical

**7:** Translation table (TT): logical — physical

### **Translation Table**

| 0x00 | 0x10 | 0x2 |
|------|------|-----|
| 0    | 1    | -   |





41

# Server-side data structure for dynamic workloads

### Client-side learned cache & TT

### Performance evaluation of XSTORE





# Client-direct Get() using model & TT Client ТТ CPU RNIC





















### Model retraining Model is retrained at server in background threads

Small cost & extra CPU usage at the server



### XSTORE uses a two-layer RMI to organize models<sup>[1]</sup> **9:** Fine-grained model retraining

[1] **R**ecursive **M**odel Inference, following "The case for the learned index @ SIGMOD'18"





# Stale model handling Background update causes stale learned models But stale learned models & TT could correctly find most keys If the key is not moved, a stale Model & TT still maintains correct Key --- Logical --- Physical



Server-side operations Find *non-trained* keys **Optimizations** of speculative execution Dynamic model expansion Fault tolerance of XSTORE Scale-out XSTORE

### Many other design details & optimizations

V and DELETE(K) are supported to the server and d on  $XT_{REE}$ , as is usual on B+tree. The *in-place* in- $\Delta TE(K)$  are shipped to the server and

in the sub-model and its translation ground (see TRAIN\_SUBMODEL in Fig. 6) ar s as usual based on XTREE, Me s can still directly perform re



### **Evaluation of XSTORE**

- We answer the following questions:
- **Comparing to server-centric designs**?
- **?** Comparing to client-direct designs?
- Does XStore provide better trade-off?









[\*] <u>R</u>ead, <u>S</u>can, <u>U</u>pdate, <u>I</u>nsert





[\*] Read, Scan, Update, Insert





[\*] Read, Scan, Update, Insert





### **XSTORE** Cell@ATC'16 DrTM+Tree@Eurosys'16 **RDMA-Memached** $\widehat{\mathbf{U}}$ **Traversing B+Tree** with one-sided RDMA is costly!



[\*] Read, Scan, Update, Insert



### The XCache in details

For a 100M KVs YCSB dataset

- **500K Linear regression as models, each 14B**
- $\therefore$  ~ 8µs to retrain each model
- **>:** ~ 8s to train the entire cache





### The XCache in details

For a 100M KVs YCSB dataset

- **500K** Linear regression as models, each 14B
- **2:** ~ 8μs to retrain each model
- **?:** ~ 8s to train the entire cache

Small model to fit the dataset







### The XCache in details

For a 100M KVs YCSB dataset

- **500K** Linear regression as models, each 14B
- $\sim 8\mu s$  to retrain each model
- **?** ~ 8s to train the entire cache

Quick retrain under dynamic workload





### Sensitive to the datas Different dataset has different May affect the performance •): Throughput drop due to increased error for complex dataset

Peak throughput (100M dataset)



| set      |                      |                   |
|----------|----------------------|-------------------|
| JLL      | Name                 | <u>Workloads</u>  |
| accuracy | Linear               | e.g., YCSB,TPC-C  |
|          | <b>Noised Linear</b> | e.g., YCSB        |
|          | Open street map      | e.g., OpenStreetM |

Average latency ( $\mu$ s)





### Learned cache vs. Tree-based cache XStore provides better *memory-performance trade-off* **YCSB-C** uniform workload

Peak throughput (YCSB-C uniform) 100 ★ XStore Tree-index 80 60 40 20





# **Current limitations and future work XSTORE** currently only supports fixed-length keys **Our paper describes our plan to support variable-length keys** Focus on simple models (e.g., LR)

- **?** Efficient upon retraining under dynamic workloads
- **?** May results in huge error for complex data distribution
- **?** Trade-off: retraining speed vs. accuracy vs. memory

Orthogonal to the design of XSTORE



IN INTERVISE Conclusion XSTORE provides *a new design* for RDMA-enabled KVS • First adopts the learned models for one-sided RDMA READ XSTORE provides better trade-offs: **Server-side CPU vs. Client-side memory vs. Performance** Please check XSTORE@ https://ipads.se.sjtu.edu.cn/projects/xstore Thanks & QA



