What quantization changes (and what it doesn’t)
- Weights: most quantization methods reduce weight memory. This is the main win.
- KV cache: often remains FP16/FP8 and can become the bottleneck at long context + high concurrency.
- Latency: can improve, stay flat, or even regress depending on kernels and batching.
- Quality: the risk is domain-dependent; measure on your golden set.
GPTQ (weight-only, post-training)
GPTQ is a popular approach for compressing weights to 4-bit with good quality retention. It is widely used for on-GPU weight-only inference.
- Good for: serving on GPUs with limited VRAM when you want to keep model capacity.
- Watch out: outliers and domain-specific tokens can degrade accuracy.
AWQ (activation-aware)
AWQ optimizes quantization with awareness of activation distributions, often improving quality at the same bit-width.
- Good for: high-quality 4-bit deployments, especially when you can use optimized kernels.
- Watch out: “paper wins” don’t always translate to your runtime; measure throughput and p95.
GGUF (llama.cpp ecosystem)
GGUF is a model file format (and tooling ecosystem) widely used in CPU-first and edge deployments. It also supports many quantization variants.
- Good for: CPU deployments, laptop/offline use, and environments where GPU availability is constrained.
- Watch out: CPU throughput can be insufficient for enterprise concurrency unless carefully scoped.
Decision matrix
| Your constraint | Most likely fit | Why |
|---|---|---|
| GPU VRAM is tight | AWQ / GPTQ | Weight memory drops while keeping GPU execution. |
| CPU-first / offline | GGUF | Optimized for llama.cpp runtimes and portability. |
| Long context + concurrency | Quantization + KV strategy | KV cache becomes dominant; measure memory per request. |
| Regulated outputs | Conservative quantization | Prefer higher precision if error cost is high. |
Production checklist
- Measure on your golden set: accuracy, groundedness, refusal rate.
- Measure performance: TTFT, p95, tok/s, concurrency saturation.
- Track memory: weights vs KV cache; validate worst-case context length.
- Keep a rollback: switch back to higher precision if regressions appear.