Paged KV-Cache Memory Manager
PageForge
A from-scratch implementation of vLLM-style paged KV-cache attention in Rust + CUDA. On-demand page allocation eliminates static VRAM waste, enabling 8–32× more concurrent LLM inference sequences per GPU.
8–32×
more concurrent seqs / GB
94%
less active VRAM per seq
+33%
P50 latency overhead
75/75
tests passing
System Design
Naive pre-allocation wastes 90%+ of VRAM
Traditional KV caches reserve max_seq_len (512) tokens of memory at sequence start, regardless of how many tokens are actually generated. For GPT-2, that's 18.9 MB per sequence, all committed upfront. A 512-page pool can serve just 16 sequences simultaneously.
16 seqs max
512-page pool · naive pre-alloc
On-demand paged allocation
PageForge allocates memory in fixed-size pages (16 tokens = 0.59 MB) on each decode step. A Rust PageAllocator manages an O(1) VecDeque free-list. Sequences claim pages as they grow and return them immediately when complete, enabling the same pool to serve many more sequences concurrently.
0.59 MB / page
16 tokens · allocated on demand
8–32× more concurrent sequences
At decode step 10 (20 tokens), each sequence uses just 2 pages vs 32 pre-allocated, freeing 94% of the pool for other sequences. The same 512-page budget that handles 16 naive sequences can now serve 128–512 concurrent sequences, depending on how far along each sequence is.
128–512 seqs
same 512-page pool · 8–32×
Concurrency Multiplier
Active VRAM vs Naive
P50 Latency Overhead
Allocator Throughput
At a Glance
PageForge vs Alternatives
Memory Efficiency
VRAM / seq at prompt (t=0)
Naive Static
18.9 MB
PageForge
0.59 MB
Δ
32× less
VRAM / seq at step 50
Naive Static
18.9 MB
PageForge
2.36 MB
Δ
8× less
Max concurrent seqs (512-page pool)
Naive Static
16
PageForge
128–512
Δ
8–32×
Allocation strategy
Naive Static
Static (eager max_len)
PageForge
On-demand (page granularity)
Δ
-
Memory fragmentation
Naive Static
Zero (fixed-size pre-alloc)
PageForge
Zero (pool-managed free-list)
Δ
-
Naive baseline pre-allocates max_seq_len=512 tokens per sequence at start, regardless of actual generation length.
Decode Latency
Decode P50 latency
HF DynamicCache
7.5 ms
PageForge
10.0 ms
Δ
+33%
Decode P99 tail latency
HF DynamicCache
10.3 ms
PageForge
11.9 ms
Δ
+16%
KV tensor layout
HF DynamicCache
Contiguous (torch.cat each step)
PageForge
Non-contiguous (page-indexed)
Δ
-
Scatter overhead
HF DynamicCache
None
PageForge
24 ops/step (bottleneck)
Δ
+2.5 ms
HF DynamicCache grows KV tensors by torch.cat each step (contiguous layout). PageForge scatter-copies into non-contiguous pages, adding overhead per step.
Memory Efficiency
VRAM Consumption vs Decode Step
32 concurrent sequences · GPT-2 124M · fp16 · Naive budget = 512 tokens / sequence (fixed)
8.0×
savings at step 50
Concurrency
Max Concurrent Sequences
512-page pool · page_size=16 · prompt ≈ 10 tokens · shows PageForge advantage degrades as seqs grow
Latency Analysis
Decode Step Latency
50 iterations · 5 warmup · RTX 4070 Laptop · prompt: "The quick brown fox…"
P50 (median latency)
P99 (tail latency)
P50 overhead
+33%
P99 overhead
+16%
Bottleneck: 24 scatter kernel dispatches + 24 torch.cat calls per decode step
Pool Lifecycle
Sequential Batch Reuse
8 concurrent seqs · 30 decode steps · free() at step 30 · Batch 2 reuses exact same pages
Naive VRAM (8 seqs)
151 MB
PageForge peak (B1)
14.2 MB
% of naive budget
9.4%
Pages after free()
0
Batch 2 reuses the exact same physical pages freed by Batch 1. Pool never grows, zero fragmentation.
System Architecture
Component Stack
DynamicCache subclass, drop-in for HuggingFace transformers
PageAllocator: O(1) VecDeque free-list · BlockTable: HashMap<seq_id, Vec<page_id>>
gather_kv · scatter_kv_layer, 100-200 GB/s on sm_89
(N_pages, page_size, n_layers × n_heads, d_head), K and V combined
DLPack bridge zero-copy PyTorch ↔ CuPy tensor interop on each decode step. No host round-trips. Verified bit-exact vs HF DynamicCache (max diff = 0.0).
System Context
Verified Hardware & Stack
Hardware
Software
Allocator Stress Test
Cycles
500
Seqs/cycle
16
Throughput
1.5M/s
Elapsed
0.04s
Mem leaks
0
OOM errors
0
75 / 75 tests passing
Rust unit (6) · CUDA kernels (10) · GPT-2 integration (7) · VRAM + latency (32) · Multi-seq (19)
Engineering Roadmap
Planned Optimisations
Fused scatter-attention kernel
Combine 24 scatter dispatches + attention into one CUDA kernel. Target: ≤8 ms P50 (−20%).
Prefix KV-cache sharing
Sequences with identical prompt prefixes share page blocks — eliminates redundant computation.
Beam search block tables
Multi-parent block table for copy-on-write page sharing across beam candidates.
CUDA Graph capture
Stable page-ID layout per step enables CUDA Graph for zero-dispatch inference.
Inspired by vLLM PagedAttention (Kwon et al., 2023). This implementation demonstrates the core paging mechanism from first principles: Rust allocator, hand-written CUDA scatter/gather kernels, and a DLPack zero-copy bridge to PyTorch.