Understand why massive context windows crash local AI models.
Simulating 4-bit quantization (e.g., Q4_K_M).
The core parameters of the AI. These files must be loaded into memory to execute inference. At 4-bit compression, this requires roughly 0.62 GB per 1 Billion parameters.
The model's active working memory. Every input token and generated output token requires expanding this cache to maintain the conversation context.