Context Length Explained: Why It Eats Your VRAM
๐ More on this topic: VRAM Requirements Guide ยท Quantization Explained ยท What Can You Run on 8GB VRAM ยท What Can You Run on 24GB VRAM
Context length is one of those specs that sounds impressive in marketing (“128K context!”) but causes real confusion when you’re trying to run models locally. More context sounds better, but it directly competes with your VRAM โ and most people don’t need anywhere near the advertised maximums.
This guide explains what context length actually is, why it matters, how it affects your hardware, and when you genuinely need those big numbers.
What Is Context Length?
Context length (or “context window”) is the maximum number of tokens the model can process at once. This includes:
- Your system prompt (instructions to the model)
- Conversation history (previous messages back and forth)
- Your current input (what you’re asking now)
- The model’s response (what it generates)
Everything has to fit within the context window. If you exceed it, older content gets truncated โ the model literally forgets the beginning of the conversation.
Tokens vs Words
Tokens aren’t exactly words. A rough approximation:
- English: ~1.3 tokens per word (or ~0.75 words per token)
- Code: Higher token density due to special characters
- Non-English: Varies significantly by language
Practical conversions:
| Tokens | Approximate Words | Rough Page Count |
|---|---|---|
| 2,048 | ~1,500 | ~3 pages |
| 4,096 | ~3,000 | ~6 pages |
| 8,192 | ~6,000 | ~12 pages |
| 32,768 | ~24,000 | ~50 pages |
| 128,000 | ~96,000 | ~200 pages |
A typical novel is 80,000-100,000 words โ roughly 100K-130K tokens.
Why Context Length Matters
Conversation Memory
Short context = short memory. With 2K context, a long conversation will “forget” what you discussed earlier. The model isn’t being dumb โ it literally can’t see those earlier messages anymore.
Document Analysis
Want to analyze a 50-page PDF? That’s roughly 25K-30K tokens. If your context window is 8K, you can only see about 15 pages at once.
Complex Tasks
Multi-step tasks accumulate context. If you’re debugging code iteratively, each exchange adds to the history. Long debugging sessions can exhaust context quickly.
Code Understanding
Codebases are token-heavy. A single file might be 1,000+ tokens. Analyzing multiple files requires substantial context.
How Context Length Affects VRAM
Here’s the part most guides skip: longer context requires more VRAM, and the relationship is significant.
The KV Cache Problem
When a model generates text, it stores intermediate calculations in something called the Key-Value (KV) cache. This cache:
- Grows linearly with context length
- Exists separately from model weights
- Can become the dominant memory consumer at long contexts
Example: 8B model VRAM breakdown
| Component | 4K Context | 32K Context |
|---|---|---|
| Model weights (Q4) | ~5GB | ~5GB |
| KV cache (FP16) | ~0.5GB | ~4.5GB |
| Overhead | ~0.5GB | ~0.5GB |
| Total | ~6GB | ~10GB |
The model weights stay constant. The KV cache scales with context. At 32K context, the cache nearly equals the model size.
KV Cache Memory Formula
A rough estimate for KV cache VRAM:
KV Cache (GB) โ (context_length ร num_layers ร hidden_size ร 4) / (1024ยณ)
Or more practically: ~0.1 MB per token for a typical 7-8B model with FP16 cache.
| Context | KV Cache (8B, FP16) |
|---|---|
| 4,096 | ~0.4 GB |
| 8,192 | ~0.8 GB |
| 16,384 | ~1.6 GB |
| 32,768 | ~3.3 GB |
| 65,536 | ~6.6 GB |
| 131,072 | ~13.2 GB |
For larger models, multiply proportionally. A 70B model’s KV cache is roughly 8-9x larger than a 7B model’s at the same context length.
The Real Constraint
This is why a model that “supports 128K context” might not actually run at 128K on your hardware. The model weights fit fine, but the KV cache at full context would overflow your VRAM.
Practical Context Limits by VRAM
Given the KV cache overhead, here are realistic maximum contexts for different VRAM tiers:
8GB VRAM (RTX 3060, 4060)
| Model Size | Max Practical Context |
|---|---|
| 7B Q4 | 8K-16K |
| 7B Q8 | 4K-8K |
| 13B Q4 | 4K-8K |
12GB VRAM (RTX 3060 12GB, 4070)
| Model Size | Max Practical Context |
|---|---|
| 7B Q4 | 32K+ |
| 7B Q8 | 16K-32K |
| 13B Q4 | 8K-16K |
16GB VRAM (RTX 4060 Ti 16GB, 4080)
| Model Size | Max Practical Context |
|---|---|
| 7B Q4 | 64K+ |
| 13B Q4 | 16K-32K |
| 30B Q4 | 8K-16K |
24GB VRAM (RTX 3090, 4090)
| Model Size | Max Practical Context |
|---|---|
| 7B Q4 | 128K+ |
| 13B Q4 | 32K-64K |
| 30B Q4 | 16K-32K |
| 70B Q4 | 4K-8K |
These are approximate โ actual limits depend on the specific model architecture and your inference software.
โ Use our Planning Tool to check exact VRAM for your setup.
Reducing Context VRAM Usage
If you’re hitting VRAM limits, several techniques can help:
1. KV Cache Quantization
Just like model weights, the KV cache can be quantized. Many inference engines support this:
- FP16 cache: Default, highest quality
- Q8 cache: Half the memory, minimal quality loss
- Q4 cache: Quarter the memory, some quality loss
In llama.cpp / Ollama, use --cache-type q8_0 or --cache-type q4_0.
Impact example (8B model at 32K context):
| Cache Type | KV Cache Size |
|---|---|
| FP16 | ~4.5 GB |
| Q8 | ~2.3 GB |
| Q4 | ~1.1 GB |
2. Flash Attention
Flash Attention is an optimized attention algorithm that reduces memory usage significantly. Most modern inference engines (Ollama, LM Studio, llama.cpp) enable it by default.
Benefits:
- Up to 75% reduction in attention memory
- Faster inference
- No quality loss
3. Sliding Window Attention
Some models (like Mistral) use sliding window attention, which only attends to recent tokens rather than the full context. This caps memory usage regardless of total context length.
4. Just Use Less Context
The simplest solution. If you don’t need 32K context, don’t configure it. Most inference tools let you set the context length โ use only what you need.
When Do You Actually Need Long Context?
Marketing pushes big context numbers, but most use cases don’t need them.
4K-8K Context (Most People)
Sufficient for:
- Normal conversations and Q&A
- Analyzing short documents (10-15 pages)
- Code assistance on individual files
- Writing and editing tasks
- Most daily use
16K-32K Context (Power Users)
Useful for:
- Longer documents (30-50 pages)
- Multi-file code analysis
- Extended conversations with full history
- Detailed research summaries
64K-128K+ Context (Specialized Use)
Actually needed for:
- Full book analysis
- Large codebase understanding
- Legal document review
- Academic paper analysis with citations
- Long-form content generation
The Honest Assessment
Most local LLM users operate fine with 8K context. The jump to 32K covers almost all remaining use cases. 128K is genuinely useful only for specific professional workflows.
The “Lost in the Middle” Problem
Here’s something the marketing doesn’t mention: models are worse at using information in the middle of long contexts.
Research consistently shows:
- Models excel at information near the beginning (primacy bias)
- Models excel at information near the end (recency bias)
- Models struggle with information buried in the middle
This means stuffing 100K tokens of context doesn’t guarantee the model will use all of it effectively. For tasks requiring information from throughout a long document, RAG (Retrieval-Augmented Generation) often outperforms raw long context.
Practical Implication
If you’re analyzing a long document:
- Put the most important context at the beginning or end
- Consider chunking and summarizing rather than feeding everything at once
- RAG might work better than maxing out context
Long context is a tool, not a magic solution.
Context Length by Model
Different models support different maximum contexts:
| Model | Native Context | Notes |
|---|---|---|
| Llama 3.1/3.2/3.3 | 128K | Full 128K support |
| Llama 3 (original) | 8K | Extended versions available |
| Qwen 2.5 / Qwen 3 | 32K-128K | Varies by size |
| Mistral 7B | 32K | Sliding window attention |
| Mistral Nemo 12B | 128K | Full 128K support |
| Mixtral 8x7B | 32K | MoE architecture |
| DeepSeek V3 | 128K | Full support |
| Phi-4 | 16K | Smaller context |
“Native context” means what the model was trained on. You can sometimes extend beyond this with techniques like RoPE scaling, but quality degrades.
Checking Your Context Usage
In Ollama
# Check current context setting
ollama show modelname --modelfile
# Set context when running
ollama run modelname --ctx 8192
In LM Studio
Context length is visible in the model settings panel. You can adjust it before loading the model.
Monitoring During Use
Watch your VRAM usage as conversations grow. If you’re approaching your GPU’s limit, the model may slow down or crash.
Bottom Line
Context length determines how much the model can “see” at once โ your conversation history, documents, and its own responses all count against this limit.
Key takeaways:
More context = more VRAM. The KV cache grows linearly and can exceed model weights at long contexts.
Most people need 4K-8K. Normal conversations and document work fit fine in modest context.
16K-32K covers power use cases. Extended conversations, larger documents, multi-file code.
128K+ is specialized. Full books, large codebases, professional document analysis.
Bigger isn’t always better. The “lost in the middle” problem means models don’t use long context perfectly.
You can reduce KV cache memory. Quantized cache (Q8, Q4) saves significant VRAM with minimal quality loss.
The practical approach: Start with the default context (usually 2K-4K). Increase only when you hit actual limits. Don’t configure 128K context “just in case” โ you’re wasting VRAM that could run a larger model or enable faster inference.
# Sensible defaults for most use
ollama run qwen3:8b # Default context is usually fine
# When you actually need more
ollama run qwen3:8b --ctx 16384
Match your context to your actual needs, not the marketing specs.