

### Preview of Operating Systems and Hardware Session





### **Michio Honda** University of Edinburgh

Sudarsun Kannan **Rutgers University** 



# Memory Scaling Challenges

- Scaling application memory capacity without increasing management cost is becoming critical
- → Scaling memory protection and isolation for large address space equally critical

### I. Memory tuning is tedious



### 2. Memory security is challenging



Requires extensive application knowledge and requires constant tuning

### Hardware memory security non-scalable

### Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

A.H. Hunter Jane Street Capital\* Chris Kennelly Google Paul Turner Google Darryl Gove Google

Parthasarathy Ranganathan Google Tipp Moseley Google

# Memory Allocator Challenges

- $\rightarrow$  Long history of memory allocators designed for specific application needs
  - Concurrency and low fragmentation (e.g., Hoard, jemalloc, TCMalloc)
  - Minimize L1 misses (e.g., Dice), increase locality (e.g., mimallloc)
- $\rightarrow$  However, allocators are not optimized for HugePages
  - HugePages becoming increasingly ubiquitous in large scale applications
  - Could substantially reduce TLB misses by increasing RAM coverage
- → Using existing allocators with HugePages could increase fragmentation and are inefficient for warehouse scale systems running several applications

# TEMERAIRE

- Hugepage-aware user-level allocator using TCMALLOC  $\rightarrow$
- Aims at densely packing huge pages grouped into few, saturated bins  $\rightarrow$
- Balances memory usage and page allocation costs through adaptive huge page  $\rightarrow$ release



Average 6% reduction TLB misses and 26% reduction in memory usage across  $\rightarrow$ a fleet of applications



### Divide allocator's caches and serve requests from different caches

### Scalable Memory Protection in the PENGLAI Enclave

Erhu Feng<sup>1†‡</sup>, Xu Lu<sup>1†‡</sup>, Dong Du<sup>†‡</sup>, Bicheng Yang<sup>†‡</sup>, Xueqiang Jiang<sup>†‡</sup>, Yubin Xia<sup>†§‡</sup>, Binyu Zang<sup>†§‡</sup>, Haibo Chen<sup>†§‡</sup> <sup>†</sup>Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University <sup>§</sup>Shanghai AI Laboratory <sup>‡</sup>Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China

# Hardware Enclaves 101

Hardware abstractions and support for trusted execution on untrusted  $\rightarrow$ platforms



 $\rightarrow$  Hardware enclaves: secure boot, on-chip program isolation, protected external memory, execution integrity, and other capabilities

# Hardware Enclaves Challenges

- Non-scalable memory partition/isolation  $\rightarrow$ 
  - Current hardware supports only 256MB enclaves
  - Some restrict the number of enclaves
  - Require static partitioning
- Non-scalable memory integrity protection  $\rightarrow$ 
  - Huge memory overhead to store memory integrity information (e.g., hash)
  - Hardware (e.g., Intel SGX) only supports ~256MB, demands swapping
- Non-scalable secure memory initialization  $\rightarrow$ 
  - High-cost secure memory initialization increases enclave setup cost
  - Impractical for serverless applications



# **PENGLAI Enclave**

- Scalable secure memory protection mechanisms for enclaves  $\rightarrow$
- Approach to Scaling: novel *Guarded Page Table* structure
- Guarded Page Table Intuition: map secure and unsecure pages to separate  $\rightarrow$ non-secure host page table and secure enclave page table



Scaling Integrity Protection: Mountable Merkle Tree (MMT), a SubTree  $\rightarrow$ structure to reduce both on-die and in-memory storage overhead



## (NrOS and nanoPU)

Single core

 Just protect critical sections from interrupts
 I/O was also slow



- Single core
  - Just protect critical sections from interrupts
  - I/O was also slow
- Multiple CPU cores
  - Giant lock
  - Fine-grained locks
  - Reader-writer locks



- Single core
  - Just protect critical sections from interrupts
  - I/O was also slow
- Multiple CPU cores
  - Giant lock
  - Fine-grained locks
  - Reader-writer locks



- Single core
  - Just protect critical sections from interrupts
  - I/O was also slow
- Multiple CPU cores
  - Giant lock
  - Fine-grained locks
  - Reader-writer locks



- Single core
  - Just protect critical sections from interrupts
  - I/O was also slow
- Multiple CPU cores
  - Giant lock
  - Fine-grained locks
  - Reader-writer locks
- Multiple CPU packages (sockets)
  - NUMA-aware memory allocation and scheduling



- Single core
  - Just protect critical sections from interrupts
  - I/O was also slow
- Multiple CPU cores
  - Giant lock
  - Fine-grained locks
  - Reader-writer locks
- Multiple CPU packages (sockets)
  - NUMA-aware memory allocation and scheduling
- Fast I/O
  - Interrupt mitigation and load-balancing
  - New APIs
    - (kqueue/epoll/netmap/io\_uring)



- Single core -Just protect critical sections from interrupts I/O was also slow Multiple CPU cores — Giant lock Fine-grained locks Reader-writer locks Multiple CPU packages (sockets) -NUMA-aware memory allocation and
- Fast I/O —

scheduling

- Interrupt mitigation and load-balancing All of these make kernel code complex and
- New APIs (kqueue/epoll/netmap/io\_uring)



scalable!

Lock socket (non-sleepable) / socket is also ref-counted



### Check another socket lock (sleepable one)

error-prone, but such a kernel is still not

## **NrOS**

| Design          | Synchronizati<br>on | Kernel<br>programming | Scalabilit<br>y |
|-----------------|---------------------|-----------------------|-----------------|
| Monolithic      | Shared states       | Hard                  | Low             |
| Multikerne<br>l | Message<br>passing  | Easy                  | Low             |
| NrOS            | Operation<br>logs   | Easy                  | High            |

- Use of shared last-level CPU cache

- Operation logs shared by per-NUMA-node replicas
  - Synchronization batching \_
- NetBSD LibOS
  - POSIX app support \_

User Kernel

cache

memory





### nanoPU

- Co-designing NIC and CPU
  - NIC places receiving data directly in a CPU register file
- Ultrafast small RPCs (nanoRequests)
  - High-rate small requests are hard to handle, because most overheads are per-packet or per-request, NOT per bytes
  - nanoPU reduces both average and tail latency \_
- Design highlights
  - Avoid the two latency sources:
    - Host stack \_
      - Bypass the stack and memory hierarchy
    - Queues in networks
      - Transport protocol in HW

Abstract from when a client issues an RPC request until it receives a We present the nanoPU, a new NIC-CPU co-design to response) for applications invoking many sequential RPCs; accelerate an increasingly pervasive class of datacenter appli-(2) the tail response time (i.e., the longest or 99th %ile RPC cations: those that utilize many small Remote Procedure Calls response time) for applications with large fanouts (e.g., map-





### The nanoPU: A Nanosecond Network Stack for Datacenters

Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz\*, Changhoon Kim, and Nick McKeown Stanford University \*Purdue University



### nanoPU

Tail latency matters in data centers



### Operating Systems and Hardware Session Thursday, July 15 7:00 am-8:15 am (PDT)