LLM Memory Bottleneck Simulator

Understand why massive context windows crash local AI models.

Configuration

16GB 64GB 128GB

Simulating 4-bit quantization (e.g., Q4_K_M).

4k 128k 256k

Model Weights (Static)

The core parameters of the AI. These files must be loaded into memory to execute inference. At 4-bit compression, this requires roughly 0.62 GB per 1 Billion parameters.

KV-Cache (Dynamic)

The model's active working memory. Every input token and generated output token requires expanding this cache to maintain the conversation context.

Memory Distribution

macOS VRAM Limit (~75%)
OS & Apps
Weights
KV-Cache
Swap Memory
(Extremely Slow)
VRAM Available
-- GB
Model Weights
-- GB
KV-Cache
-- GB
Performance
Optimal