Pick your machine and a model. The meter shows exactly how the memory budget is spent — weights, KV cache, and overhead — against what your hardware can actually hand to the model. Throughput is estimated from memory bandwidth.
| Quant | Weights | Total @ ctx | tok/s | Fit |
|---|
Weights = params × effective-bits ÷ 8. Effective bits use GGUF k-quant averages (Q4_K_M ≈ 4.83 bpw), so numbers track real downloaded file sizes, not the naive bit count.
KV cache = 2 × layers × kv_heads × head_dim × context × 2 bytes (fp16), computed from each model's real architecture. Custom models use a GQA-based estimate. Quantizing the KV cache to 8-bit roughly halves this.
Overhead = compute/activation buffer ≈ 0.8 GB + 5% of weights.
Usable memory — unified: 75% of total (macOS GPU cap; raisable via iogpu.wired_limit_mb); VRAM: 92%; system RAM: 80%.
Throughput ≈ bandwidth ÷ active-weights × efficiency (~0.55 GPU/unified, ~0.4 CPU). This is generation speed and a ballpark — real numbers vary with engine, batch, and prompt length. MoE models use active (not total) params for speed.