Quantization Explained: What It Means for Local AI
📚 More on this topic: VRAM Requirements · GPU Buying Guide · Run Your First Local LLM
You download a model. You see this:
llama-3.1-8b-instruct-Q4_K_M.gguf
llama-3.1-8b-instruct-Q5_K_M.gguf
llama-3.1-8b-instruct-Q6_K.gguf
llama-3.1-8b-instruct-Q8_0.gguf
llama-3.1-8b-instruct-F16.gguf
And you think: What the hell do these mean? Which one do I pick?
You’re not alone. Quantization is one of those topics where everyone assumes you already know what they’re talking about. Nobody stops to explain it clearly.
This guide fixes that. By the end, you’ll understand what quantization is, why it matters, and exactly which format to choose for your hardware.
The Problem Nobody Explains
Every AI model is, at its core, billions of numbers. A 7B parameter model has 7 billion numerical values that define how it thinks. At full precision (16 bits per number), storing those values takes about 14GB of space.
That’s a problem if your GPU only has 8GB or 12GB of VRAM.
Quantization is the solution. It’s a way to compress those numbers so the model takes less space—and therefore less VRAM to run. The tradeoff is some loss in precision, which might affect quality. The art is finding the compression level where you save the most space with the least quality loss.
That’s what all those letters and numbers (Q4_K_M, Q8_0, etc.) represent: different compression levels with different tradeoffs.
What Quantization Actually Is
The Plain English Version
Imagine you’re storing the number 3.14159265359.
At full precision, you keep all those decimal places. Accurate, but takes space.
With quantization, you might round it to 3.14 or even just 3. Less accurate, but way smaller.
Now multiply that by 7 billion numbers, and you see why this matters. Rounding each number a little bit adds up to massive space savings.
The analogy: Quantization is to AI models what JPEG compression is to photos. A high-quality JPEG looks nearly identical to the RAW file but takes a fraction of the space. Push compression too far, and you see artifacts. Quantization works the same way.
Why Models Need So Much Memory
A 7B parameter model at full precision (FP16) needs:
7 billion parameters × 2 bytes per parameter = 14 GB
That’s just to load the model weights. Running inference adds more overhead for the “working memory” (called KV cache). A 7B model at FP16 realistically needs 16-18GB of VRAM to run comfortably.
Most consumer GPUs have 8-12GB. Something has to give.
Quantization is what gives. By reducing precision from 16 bits to 8, 6, 5, or 4 bits, you can shrink that memory requirement dramatically:
| Precision | Bits per Weight | 7B Model Size | Approximate VRAM |
|---|---|---|---|
| FP16 | 16 bits | ~14 GB | 16-18 GB |
| Q8_0 | 8 bits | ~8.5 GB | 10-12 GB |
| Q6_K | 6.5 bits | ~6.6 GB | 8-10 GB |
| Q5_K_M | 5.5 bits | ~5.7 GB | 7-9 GB |
| Q4_K_M | 4.5 bits | ~4.9 GB | 6-8 GB |
| Q4_K_S | 4 bits | ~4.7 GB | 6-7 GB |
That’s why quantization matters: it’s the difference between “runs on my hardware” and “doesn’t.”
The Tradeoff You’re Making
What You Gain
Smaller files. A Llama 3.1 8B model at FP16 is around 16GB. At Q4_K_M, it’s under 5GB. That’s a 70% reduction.
Lower VRAM requirements. The same model that needed 18GB of VRAM now runs on 8GB. Suddenly your RTX 3060 can run models that previously required a 3090.
Faster loading times. Smaller files load faster. A Q4_K_M model loads in seconds versus a minute or more for FP16.
More headroom for context. VRAM not used by model weights can be used for longer conversations (larger KV cache). Quantization indirectly gives you longer context windows.
What You Lose
Some precision. Every quantization level introduces small errors. Those errors compound through the model’s calculations.
Potentially worse output. On complex tasks—multi-step reasoning, precise math, nuanced creative writing—highly quantized models may produce slightly worse results.
Diminishing returns at extremes. Q4 to Q3 saves less space than Q8 to Q4, but the quality drop is more noticeable. Below Q3, quality degrades rapidly.
Here’s the good news: for most tasks, the quality loss is barely perceptible. You’d need to run careful benchmarks to notice the difference between Q4_K_M and Q8_0 in casual conversation.
Common Quantization Formats Explained
The GGUF Naming System
When you see a filename like Q4_K_M, here’s what each part means:
- Q = Quantized
- Number (4, 5, 6, 8) = Bits per weight (lower = smaller file, more compression)
- K = K-quant method (newer, better than legacy methods)
- S/M/L = Size variant (Small, Medium, Large—refers to how different layers are quantized)
So Q4_K_M means: 4-bit quantization, using the K-quant method, medium variant.
Why K-Quants Are Better
Older quantization methods (Q4_0, Q4_1, Q5_0, etc.) used simple uniform rounding. K-quants are smarter: they use a two-level scheme that preserves more important weights at higher precision while compressing less critical ones more aggressively.
The result: K-quants achieve better quality at the same file size. Always prefer K-quants over legacy formats. If you see Q4_0 and Q4_K_M available, pick Q4_K_M.
Format Breakdown
| Format | Bits | Relative Size | Quality | Use Case |
|---|---|---|---|---|
| FP16/BF16 | 16 | Largest (100%) | Perfect baseline | Benchmarking, max quality |
| Q8_0 | 8 | ~50% | Near-lossless | When VRAM isn’t tight |
| Q6_K | 6.5 | ~40% | Excellent | Quality-sensitive tasks |
| Q5_K_M | 5.5 | ~35% | Very good | Coding, reasoning, writing |
| Q5_K_S | 5.25 | ~33% | Good | Slight quality trade for size |
| Q4_K_M | 4.5 | ~30% | Good (sweet spot) | General use, recommended |
| Q4_K_S | 4 | ~28% | Acceptable | Memory-constrained |
| Q3_K_M | 3.5 | ~22% | Noticeable loss | Very tight VRAM only |
| Q2_K | 2.5 | ~18% | Significant loss | Extreme cases only |
The Winner: Q4_K_M
For most people, Q4_K_M is the right choice. Here’s why:
- 70% smaller than FP16
- Runs on 8GB GPUs (for 7B models)
- Retains ~90-95% of original quality
- Fast inference
- Widely available for most models
It’s marked as “recommended” by llama.cpp for good reason. Start here unless you have a specific reason not to.
How Much VRAM You Actually Save
Let’s look at real file sizes for popular models:
Llama 3.1 8B Instruct
| Format | File Size | VRAM Needed (approx) |
|---|---|---|
| F16 | 16.1 GB | 18-20 GB |
| Q8_0 | 8.5 GB | 10-12 GB |
| Q6_K | 6.6 GB | 8-10 GB |
| Q5_K_M | 5.7 GB | 7-9 GB |
| Q4_K_M | 4.9 GB | 6-8 GB |
| Q4_K_S | 4.7 GB | 6-7 GB |
Quick VRAM Estimation Formula
For a rough estimate of VRAM needed:
VRAM ≈ (Parameters × Bits per Weight ÷ 8) + 1-2 GB overhead
For a 7B model at Q4 (4 bits):
(7B × 4 ÷ 8) + 1.5GB = 3.5GB + 1.5GB = ~5GB VRAM
Real-world usage is slightly higher due to KV cache and runtime overhead, but this gets you in the ballpark.
What Different VRAM Amounts Get You
| Your VRAM | What You Can Run |
|---|---|
| 6 GB | 7B at Q4_K_S, smaller models at higher quants |
| 8 GB | 7B at Q4_K_M comfortably, 13B at Q3 (slow) |
| 12 GB | 7B at Q6_K, 13B at Q4_K_M, small 30B at Q3 |
| 16 GB | 7B at Q8, 13B at Q5_K_M, 30B at Q4 |
| 24 GB | Almost anything at Q4_K_M or higher |
→ Use our Planning Tool to check exact VRAM for your setup.
Quality Impact: When You Notice, When You Don’t
Tasks Where Quantization Barely Matters
For these use cases, Q4_K_M performs nearly identically to Q8 or FP16:
- Casual conversation — Chatting, Q&A, brainstorming
- Simple coding tasks — Boilerplate, syntax help, basic debugging
- Summarization — Condensing text
- Translation — Common language pairs
- Creative writing — First drafts, idea generation
If you’re using a local LLM as a general assistant, Q4_K_M is plenty.
Tasks Where Quality Matters More
For these, consider Q5_K_M or Q6_K:
- Complex reasoning — Multi-step logic, math problems
- Precise coding — Subtle bugs, complex algorithms
- Instruction following — Very specific formatting requirements
- Long-context tasks — Maintaining coherence over many pages
- Factual retrieval — When accuracy of specific details matters
The difference isn’t dramatic—maybe 5-10% worse on benchmarks—but if you’re doing serious work and have the VRAM, it’s worth going higher.
The Perplexity Numbers
Perplexity measures how “surprised” a model is by text—lower is better. Here’s how quantization affects it (Llama 2 7B):
| Format | Perplexity | Change from FP16 |
|---|---|---|
| FP16 | 7.49 | baseline |
| Q8_0 | 7.49 | +0.00 (negligible) |
| Q6_K | 7.53 | +0.04 |
| Q5_K_M | 7.54 | +0.05 |
| Q4_K_M | 7.57 | +0.08 |
| Q4_K_S | 7.61 | +0.12 |
| Q3_K_M | 7.76 | +0.27 |
| Q2_K | 8.65 | +1.16 |
Notice how small the differences are until you hit Q3 and below. The jump from Q4_K_M to Q2_K is larger than FP16 to Q4_K_M.
Important caveat: Perplexity doesn’t tell the whole story. Some quantized models score worse on perplexity but perform similarly (or even better) on specific benchmarks. Always test on your actual use case.
How to Choose the Right Quant for Your Hardware
Match Your VRAM
| Your VRAM | Recommended Quant for 7-8B | Recommended Quant for 13B |
|---|---|---|
| 6 GB | Q4_K_S (tight fit) | Too big |
| 8 GB | Q4_K_M | Q3_K_M (slow) |
| 12 GB | Q6_K or Q5_K_M | Q4_K_M |
| 16 GB | Q8_0 | Q5_K_M or Q6_K |
| 24 GB | FP16 (why not?) | Q8_0 |
The Decision Flowchart
- Does the model fit at Q4_K_M? Start there. It’s the sweet spot.
- Want better quality? Try Q5_K_M or Q6_K. Worth it for coding and reasoning.
- Still too big? Drop to Q4_K_S or Q3_K_M. Expect some quality loss.
- Have VRAM to spare? Go Q8_0 or higher. Diminishing returns, but why not.
- Q3 still too big? You need a smaller model, not more compression.
The Bigger Model Rule
Here’s a key insight: a larger model at lower quantization often beats a smaller model at higher quantization.
Example: A 13B model at Q4_K_M typically outperforms a 7B model at Q8_0—even though the 7B has higher precision. Model capability matters more than quantization level.
If you’re choosing between:
- Llama 3.1 8B at Q8_0 (~8.5 GB)
- Llama 3.1 70B at Q4_K_M (~40 GB)
And both fit in your VRAM? Take the 70B. It’s not even close.
Where to Find Quantized Models
Hugging Face
The main source. Look for uploaders like:
- bartowski — Reliable, consistent, well-documented
- TheBloke — Huge library (mostly older models now)
- QuantFactory — Good selection of newer models
Search for your model name + “GGUF” and you’ll find options.
Ollama
Pre-quantized and ready to run. When you do ollama pull llama3.1:8b, you’re getting a quantized version (typically Q4_K_M equivalent). No decisions needed. If you’re new to Ollama, our beginner’s guide walks through the full setup.
ollama pull llama3.1:8b # Default quantization
ollama pull llama3.1:8b-q8 # Higher quality, more VRAM
LM Studio
Built-in model browser with Hugging Face integration. Filter by quantization level, see file sizes, one-click download. Good for exploring options visually.
Quick Reference Table
| Format | File Size (8B) | VRAM (8B) | Quality | Speed | Best For |
|---|---|---|---|---|---|
| FP16 | 16 GB | 18-20 GB | 100% | Baseline | Benchmarking |
| Q8_0 | 8.5 GB | 10-12 GB | ~99% | Fast | When VRAM allows |
| Q6_K | 6.6 GB | 8-10 GB | ~97% | Fast | Quality-sensitive work |
| Q5_K_M | 5.7 GB | 7-9 GB | ~95% | Fast | Coding, reasoning |
| Q4_K_M | 4.9 GB | 6-8 GB | ~92% | Fast | General use (recommended) |
| Q4_K_S | 4.7 GB | 6-7 GB | ~90% | Fastest | Memory-constrained |
| Q3_K_M | 3.8 GB | 5-6 GB | ~85% | Faster | Very tight VRAM |
| Q2_K | 3.0 GB | 4-5 GB | ~70% | Fastest | Last resort |
The Bottom Line
Quantization lets you run AI models that wouldn’t otherwise fit on your hardware. It’s not magic—you’re trading some precision for smaller size—but the tradeoff is usually worth it.
The practical advice:
Start with Q4_K_M. It’s the default for a reason. Good quality, runs on most hardware, widely available.
Go higher if you can. Have 12GB+ VRAM? Try Q5_K_M or Q6_K. The quality bump is noticeable for coding and reasoning tasks.
Go lower only if you must. Q3 and Q2 exist for extreme cases. Expect quality loss.
Model size > quantization level. A bigger model at Q4 beats a smaller model at Q8. Always.
Test on your actual tasks. Benchmarks are useful, but your experience is what matters. If Q4_K_M works for what you do, that’s your answer.
Stop overthinking it. Download Q4_K_M, start using the model, and only revisit the decision if you hit actual limitations.