PageForge

Paged KV-Cache Memory Manager

PageForge

A from-scratch implementation of vLLM-style paged KV-cache attention in Rust + CUDA. On-demand page allocation eliminates static VRAM waste, enabling 8–32× more concurrent LLM inference sequences per GPU.

Rust · PyO3CUDA C · nvrtcCuPy RawKernelPyTorch fp16DLPack zero-copy

8–32×

more concurrent seqs / GB

94%

less active VRAM per seq

+33%

P50 latency overhead

75/75

tests passing

System Design

The Problem01

Naive pre-allocation wastes 90%+ of VRAM

Traditional KV caches reserve max_seq_len (512) tokens of memory at sequence start, regardless of how many tokens are actually generated. For GPT-2, that's 18.9 MB per sequence, all committed upfront. A 512-page pool can serve just 16 sequences simultaneously.

16 seqs max

512-page pool · naive pre-alloc

The Solution02

On-demand paged allocation

PageForge allocates memory in fixed-size pages (16 tokens = 0.59 MB) on each decode step. A Rust PageAllocator manages an O(1) VecDeque free-list. Sequences claim pages as they grow and return them immediately when complete, enabling the same pool to serve many more sequences concurrently.

0.59 MB / page

16 tokens · allocated on demand

The Result03

8–32× more concurrent sequences

At decode step 10 (20 tokens), each sequence uses just 2 pages vs 32 pre-allocated, freeing 94% of the pool for other sequences. The same 512-page budget that handles 16 naive sequences can now serve 128–512 concurrent sequences, depending on how far along each sequence is.

128–512 seqs

same 512-page pool · 8–32×

Concurrency Multiplier

16×
max seqs · 512-page pool · step 10vs 16 naive

Active VRAM vs Naive

9.4%
14.2 MB active vs 151 MB naive (8 seqs)−90.6% VRAM

P50 Latency Overhead

+33%
7.5 ms → 10.0 ms vs HF DynamicCachescatter cost

Allocator Throughput

1.5M/s
pages/sec · 500 cycles · 0 leaksPASS

At a Glance

PageForge vs Alternatives

Memory Efficiency

Naive StaticvsPageForge

VRAM / seq at prompt (t=0)

Naive Static

18.9 MB

PageForge

0.59 MB

Δ

32× less

VRAM / seq at step 50

Naive Static

18.9 MB

PageForge

2.36 MB

Δ

8× less

Max concurrent seqs (512-page pool)

Naive Static

16

PageForge

128–512

Δ

8–32×

Allocation strategy

Naive Static

Static (eager max_len)

PageForge

On-demand (page granularity)

Δ

-

Memory fragmentation

Naive Static

Zero (fixed-size pre-alloc)

PageForge

Zero (pool-managed free-list)

Δ

-

Naive baseline pre-allocates max_seq_len=512 tokens per sequence at start, regardless of actual generation length.

Decode Latency

HF DynamicCachevsPageForge

Decode P50 latency

HF DynamicCache

7.5 ms

PageForge

10.0 ms

Δ

+33%

Decode P99 tail latency

HF DynamicCache

10.3 ms

PageForge

11.9 ms

Δ

+16%

KV tensor layout

HF DynamicCache

Contiguous (torch.cat each step)

PageForge

Non-contiguous (page-indexed)

Δ

-

Scatter overhead

HF DynamicCache

None

PageForge

24 ops/step (bottleneck)

Δ

+2.5 ms

HF DynamicCache grows KV tensors by torch.cat each step (contiguous layout). PageForge scatter-copies into non-contiguous pages, adding overhead per step.

Memory Efficiency

VRAM Consumption vs Decode Step

32 concurrent sequences · GPT-2 124M · fp16 · Naive budget = 512 tokens / sequence (fixed)

8.0×

savings at step 50

Naive pre-alloc (max 512 tokens)
PageForge paged

Concurrency

Max Concurrent Sequences

512-page pool · page_size=16 · prompt ≈ 10 tokens · shows PageForge advantage degrades as seqs grow

Latency Analysis

Decode Step Latency

50 iterations · 5 warmup · RTX 4070 Laptop · prompt: "The quick brown fox…"

P50 (median latency)

HF DynamicCache7.5 ms
PageForge Paged10.0 ms

P99 (tail latency)

HF DynamicCache10.3 ms
PageForge Paged11.9 ms

P50 overhead

+33%

P99 overhead

+16%

Bottleneck: 24 scatter kernel dispatches + 24 torch.cat calls per decode step

Pool Lifecycle

Sequential Batch Reuse

8 concurrent seqs · 30 decode steps · free() at step 30 · Batch 2 reuses exact same pages

Zero fragmentation verified

Naive VRAM (8 seqs)

151 MB

PageForge peak (B1)

14.2 MB

% of naive budget

9.4%

Pages after free()

0

Batch 1: initial allocation
Batch 2: same pages reused
free() called
Naive budget: 151 MB

Batch 2 reuses the exact same physical pages freed by Batch 1. Pool never grows, zero fragmentation.

System Architecture

Component Stack

Python User Codepageforge.cache.PagedKVCache

DynamicCache subclass, drop-in for HuggingFace transformers

L0
alloc_for_seq / free_seq
Rust PageForgePyO3 · maturin

PageAllocator: O(1) VecDeque free-list · BlockTable: HashMap<seq_id, Vec<page_id>>

L1
page_ids → gather / scatter
CUDA KernelsCuPy RawKernel (nvrtc)

gather_kv · scatter_kv_layer, 100-200 GB/s on sm_89

L2
non-contiguous ↔ contiguous
GPU Poolfp16 · CuPy

(N_pages, page_size, n_layers × n_heads, d_head), K and V combined

L3

DLPack bridge zero-copy PyTorch ↔ CuPy tensor interop on each decode step. No host round-trips. Verified bit-exact vs HF DynamicCache (max diff = 0.0).

System Context

Verified Hardware & Stack

Hardware

GPUNVIDIA RTX 4070 Laptop GPU
VRAM8.0 GB
Computesm_89 (Ada Lovelace)

Software

CUDA nvcc12.6
CUDA runtime12.8
PyTorch2.11.0+cu128
CuPy14.1.1
transformers5.9.0
Rust1.90.0
Python3.12.7

Allocator Stress Test

PASS

Cycles

500

Seqs/cycle

16

Throughput

1.5M/s

Elapsed

0.04s

Mem leaks

0

OOM errors

0

75 / 75 tests passing

Rust unit (6) · CUDA kernels (10) · GPT-2 integration (7) · VRAM + latency (32) · Multi-seq (19)

Engineering Roadmap

Planned Optimisations

01

Fused scatter-attention kernel

Combine 24 scatter dispatches + attention into one CUDA kernel. Target: ≤8 ms P50 (−20%).

planned
02

Prefix KV-cache sharing

Sequences with identical prompt prefixes share page blocks — eliminates redundant computation.

planned
03

Beam search block tables

Multi-parent block table for copy-on-write page sharing across beam candidates.

planned
04

CUDA Graph capture

Stable page-ID layout per step enables CUDA Graph for zero-dispatch inference.

planned

Inspired by vLLM PagedAttention (Kwon et al., 2023). This implementation demonstrates the core paging mechanism from first principles: Rust allocator, hand-written CUDA scatter/gather kernels, and a DLPack zero-copy bridge to PyTorch.