LLM Quantization: GPTQ vs AWQ vs GGUF

What quantization changes (and what it doesn’t)

Weights: most quantization methods reduce weight memory. This is the main win.
KV cache: often remains FP16/FP8 and can become the bottleneck at long context + high concurrency.
Latency: can improve, stay flat, or even regress depending on kernels and batching.
Quality: the risk is domain-dependent; measure on your golden set.

GPTQ is a popular approach for compressing weights to 4-bit with good quality retention. It is widely used for on-GPU weight-only inference.

Good for: serving on GPUs with limited VRAM when you want to keep model capacity.
Watch out: outliers and domain-specific tokens can degrade accuracy.

AWQ optimizes quantization with awareness of activation distributions, often improving quality at the same bit-width.

Good for: high-quality 4-bit deployments, especially when you can use optimized kernels.
Watch out: “paper wins” don’t always translate to your runtime; measure throughput and p95.

GGUF is a model file format (and tooling ecosystem) widely used in CPU-first and edge deployments. It also supports many quantization variants.

Good for: CPU deployments, laptop/offline use, and environments where GPU availability is constrained.
Watch out: CPU throughput can be insufficient for enterprise concurrency unless carefully scoped.

Your constraint	Most likely fit	Why
GPU VRAM is tight	AWQ / GPTQ	Weight memory drops while keeping GPU execution.
CPU-first / offline	GGUF	Optimized for llama.cpp runtimes and portability.
Long context + concurrency	Quantization + KV strategy	KV cache becomes dominant; measure memory per request.
Regulated outputs	Conservative quantization	Prefer higher precision if error cost is high.