Llama 3 Guide: Every Size from 1B to 405B
๐ More on this topic: Qwen Models Guide ยท DeepSeek Models Guide ยท Mistral & Mixtral Guide ยท VRAM Requirements
Meta’s Llama 3 is the most recognizable name in open-weight AI. It’s the model most people start with, the base for thousands of community fine-tunes, and the reason “run your own LLM” became a mainstream idea.
But the naming is a mess. Llama 3.1, 3.2, 3.3 โ they’re not sequential upgrades. They’re different model families released for different purposes, and picking the wrong one wastes your VRAM on a worse model. This guide cuts through it.
The Llama 3 Family at a Glance
| Model | Parameters | Release | Context | Type | Ollama Command |
|---|---|---|---|---|---|
| Llama 3.2 1B | 1.24B | Sep 2024 | 128K | Text only | ollama pull llama3.2:1b |
| Llama 3.2 3B | 3.21B | Sep 2024 | 128K | Text only | ollama pull llama3.2:3b |
| Llama 3.1 8B | 8.03B | Jul 2024 | 128K | Text only | ollama pull llama3.1:8b |
| Llama 3.2 Vision 11B | 10.67B | Sep 2024 | 128K | Text + Images | ollama pull llama3.2-vision:11b |
| Llama 3.3 70B | 70.6B | Dec 2024 | 128K | Text only | ollama pull llama3.3:70b |
| Llama 3.2 Vision 90B | 88.8B | Sep 2024 | 128K | Text + Images | ollama pull llama3.2-vision:90b |
| Llama 3.1 405B | 405B | Jul 2024 | 128K | Text only | Too large for most local setups |
All models support 128K context. All have instruct-tuned versions (the ones you want for chat). The base/pretrained versions exist too but are only useful for fine-tuning.
Which Version? 3.1 vs 3.2 vs 3.3
This is where people get confused. The version numbers don’t mean “3.3 is always better than 3.2.”
Llama 3.1 (July 2024) โ The big launch. 8B, 70B, and 405B text models. Extended context to 128K tokens. Multilingual support for 8 languages. The 8B is still the one to use at that size.
Llama 3.2 (September 2024) โ Two separate things packaged under one name:
- Small text models (1B, 3B) built for edge devices and mobile. Not downsized 8B models โ they’re distinct architectures using Grouped-Query Attention, optimized to be tiny.
- Vision models (11B, 90B) that can process images. Built by adding vision adapters to the Llama 3.1 text backbone. The 11B runs on 8 GB VRAM.
Llama 3.3 (December 2024) โ One model only: 70B text. Trained to match 3.1 405B performance at 70B size. Massive improvements over 3.1 70B in math (+9.2 on MATH), instruction following (92.1 IFEval, beating 405B), and multilingual (+4.2 on MGSM). This is the 70B you should use โ 3.1 70B is obsolete.
The version you actually want
| Size | Use This | Not This |
|---|---|---|
| 1B or 3B | Llama 3.2 | (only option) |
| 8B | Llama 3.1 | (only option) |
| 11B Vision | Llama 3.2 Vision | (only option) |
| 70B | Llama 3.3 | |
| 405B | Llama 3.1 | (only option, but consider the API instead) |
Every Size, Honestly
Llama 3.2 1B โ Edge Device Territory
VRAM: ~1-2 GB (Q4), ~3 GB (FP16)
Ollama: ollama pull llama3.2:1b
This is for phones, tablets, and embedded systems. It can summarize short text, do simple instruction following, and handle basic classification. It cannot hold a real conversation or do anything requiring reasoning.
Competitive with Gemma 2 at this size. Useful if you need something that runs on almost anything, including a Raspberry Pi with enough RAM. Not useful for much else.
Llama 3.2 3B โ Lightweight but Limited
VRAM: ~2-3 GB (Q4_K_M), ~6-7 GB (FP16)
Ollama: ollama pull llama3.2:3b
The 3B can handle short conversations, simple Q&A, summarization, and basic tool calling. Meta designed it for on-device use and it shows โ it outperforms Gemma 2 2.6B and Phi 3.5-mini on instruction following and summarization.
Supports 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai). The 128K context window is supported at full precision, though quantized versions typically cap at 8K.
Good for: a fast assistant on your laptop that doesn’t eat your GPU, preprocessing/routing tasks, or situations where you need an answer in milliseconds and don’t care if it’s occasionally wrong.
Not good for: coding, complex reasoning, creative writing, or anything you’d trust with an important decision.
Llama 3.1 8B โ The Workhorse
VRAM: ~5 GB (Q4_K_M), ~8.5 GB (Q8_0), ~16 GB (FP16)
Ollama: ollama pull llama3.1:8b
This is where Llama gets genuinely useful. The 8B handles conversation, coding assistance, writing, summarization, and light reasoning well enough for daily use. It runs comfortably on any modern GPU with 8 GB VRAM at Q4_K_M.
The benchmarks tell the story: it outperforms every open model from 2023 regardless of size, including the original Llama 2 70B. For a model that fits on a single consumer GPU, that’s remarkable.
The honest comparison: Qwen 3 8B now beats Llama 3.1 8B on most benchmarks, especially math, reasoning, and multilingual tasks. Qwen 3 also has a thinking mode toggle that Llama lacks. If you’re choosing purely on capability, Qwen 3 8B wins in early 2026.
So why use Llama 3.1 8B at all? Three reasons:
- Fine-tune ecosystem. Thousands of Llama 3 fine-tunes exist. Dolphin, Hermes, WizardLM โ the most popular community models are Llama-based. Qwen fine-tunes exist but the selection is smaller.
- Speed. Llama 3.1 8B is slightly faster at inference than Qwen 3 8B, particularly when Qwen’s thinking mode is active.
- Compatibility. Every tool, framework, and tutorial supports Llama. It’s the default model in almost every getting-started guide.
Llama 3.3 70B โ The Flagship
VRAM: ~43 GB (Q4_K_M), ~70 GB (Q8_0), ~140 GB (FP16)
Ollama: ollama pull llama3.3:70b
This is the model that matters. Llama 3.3 70B was trained to match the 405B’s performance, and on several benchmarks it actually exceeds it:
| Benchmark | Llama 3.1 70B | Llama 3.3 70B | Llama 3.1 405B |
|---|---|---|---|
| MATH (0-shot) | 67.8 | 77.0 | 73.8 |
| IFEval | โ | 92.1 | 88.6 |
| HumanEval | โ | 88.4 | 89.0 |
| MMLU (0-shot) | 86.0 | 86.0 | 88.6 |
| MGSM (multilingual) | 86.9 | 91.1 | โ |
| GPQA Diamond | 48.0 | 50.5 | โ |
The math improvement (+9.2 over 3.1 70B) is the biggest jump. Instruction following at 92.1 beats every model in its class including 405B. Multilingual went from good to excellent.
The hardware reality: 43 GB at Q4_K_M means no single consumer GPU can run it comfortably. Your options:
- Dual RTX 3090s (48 GB total) โ tight but works with short context
- Single RTX A6000 (48 GB) or similar workstation card
- CPU offload with 64+ GB system RAM โ slow (2-5 tok/s) but functional. Pull the model in Ollama and it handles offloading automatically
- Pure CPU with 64 GB+ RAM โ expect 1-3 tok/s depending on your processor
For more on running large models, see our 24 GB VRAM guide.
Llama 3.1 405B โ Impractical Locally
VRAM: ~230+ GB (Q4_K_M), ~405 GB (Q8_0)
Ollama: Technically ollama pull llama3.1:405b but you almost certainly can’t run it
Unless you have a rack of GPUs or a machine with 512+ GB of RAM (for pure CPU inference at glacial speeds), this isn’t a local model. Use it via API providers like Together AI, Fireworks, or Groq.
And honestly, Llama 3.3 70B makes 405B mostly redundant. The 70B matches or beats it on several benchmarks at 1/6th the hardware cost. The 405B’s advantage shows only on narrow edge cases and the hardest reasoning tasks.
VRAM Requirements
Text Models
| Model | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| 1B | ~1 GB | ~1.2 GB | ~1.5 GB | ~2.5 GB |
| 3B | ~2.5 GB | ~3 GB | ~4 GB | ~6.5 GB |
| 8B | ~5 GB | ~5.5 GB | ~8.5 GB | ~16 GB |
| 70B | ~43 GB | ~48 GB | ~70 GB | ~140 GB |
| 405B | ~230 GB | ~260 GB | ~405 GB | ~810 GB |
These are weights only. Add 1-2 GB for KV cache at default context lengths, more for longer context. A 70B model at 32K context adds ~14 GB for KV cache alone. At 128K, the KV cache exceeds 40 GB.
Tip: Set OLLAMA_KV_CACHE_TYPE=q8_0 to halve KV cache memory with minimal quality loss. This is especially valuable for the 70B model.
Vision Models
| Model | Q4_K_M | FP16 |
|---|---|---|
| 11B Vision | ~6-7 GB | ~22 GB |
| 90B Vision | ~50+ GB | ~180 GB |
The 11B Vision model is the practical choice. It fits on 8 GB GPUs at Q4 and handles image understanding, OCR, chart reading, and visual Q&A.
Which GPU for Which Model
| GPU | VRAM | Best Llama Model |
|---|---|---|
| GTX 1660 / RTX 3060 | 6 GB | 3B (Q4_K_M), 8B (Q3-Q4 tight) |
| RTX 3060 12GB / 4060 Ti 16GB | 12-16 GB | 8B (Q8_0), 11B Vision (Q4) |
| RTX 3090 / 4090 | 24 GB | 8B (FP16), 11B Vision (Q8) |
| 2x RTX 3090 / RTX A6000 | 48 GB | 70B (Q4_K_M, short context) |
| 4x RTX 3090 / 2x A6000 | 96 GB | 70B (Q8_0 with headroom) |
โ Use our Planning Tool to check exact VRAM for your setup.
Llama 3 vs the Competition (Early 2026)
The open model landscape has shifted since Llama 3’s releases. Here’s where things stand.
At 8B: Llama 3.1 vs Qwen 3 vs Gemma 3
| Llama 3.1 8B | Qwen 3 8B | Gemma 3 12B | |
|---|---|---|---|
| Math | Good | Best | Good |
| Coding | Good | Best | Good |
| Reasoning | Good | Best (thinking mode) | Better |
| Multilingual | 8 languages | 29+ languages | Good |
| Speed | Fastest | Slightly slower (thinking mode) | Moderate |
| Fine-tune ecosystem | Largest | Growing | Smaller |
| Vision | No (separate 11B) | Qwen 3 VL available | Yes (built-in) |
Verdict at 8B: Qwen 3 8B wins on benchmarks. Llama 3.1 8B wins on ecosystem and compatibility. If you want the best model out of the box, pull Qwen 3. If you want the most fine-tune options or need a specific community model, Llama is still the base to build on.
At 70B: Llama 3.3 vs Qwen 2.5 72B vs DeepSeek R1
| Llama 3.3 70B | Qwen 2.5 72B | DeepSeek R1 Distill 70B | |
|---|---|---|---|
| General knowledge | Strong | Strong | Good |
| Math / Reasoning | Good (77.0 MATH) | Good | Best (94.5 MATH-500) |
| Coding | Strong (88.4 HumanEval) | Strong | Best (57.5 LiveCodeBench) |
| Instruction following | Best (92.1 IFEval) | Good | Weaker |
| Speed | Fastest | Comparable | Slower (thinking tokens) |
| Multilingual | Strong (91.1 MGSM) | Best | Moderate |
Verdict at 70B: It depends on your task. Llama 3.3 is the best general-purpose 70B โ fast, great at following instructions, strong across the board. DeepSeek R1 Distill 70B crushes it on math and reasoning but is slower and narrower. Qwen 2.5 72B is comparable to Llama 3.3 overall. For pure reasoning, see our DeepSeek guide.
The Bigger Picture
Qwen 3 (from Alibaba) has arguably overtaken Llama as the leading open model family on raw benchmarks. But Llama retains three major advantages:
- The fine-tune ecosystem. More community fine-tunes exist for Llama than all other families combined. Dolphin, Hermes, WizardLM, Dark Champion โ if you want an uncensored, creative, or specialized model, it’s probably Llama-based.
- Licensing simplicity. Meta’s Llama license is permissive for commercial use (under 700M monthly users). Qwen uses Apache 2.0, which is also fine, but Llama’s ecosystem maturity matters.
- Tool compatibility. Every local AI tool is tested against Llama first. When something breaks, Llama gets fixed first.
Multimodal: Llama 3.2 Vision
Llama 3.2 Vision is the easiest way to run a multimodal model locally. The 11B version handles images alongside text โ describe photos, read charts, extract text from documents, answer questions about diagrams.
Setup
ollama pull llama3.2-vision:11b
Using It
Command line:
ollama run llama3.2-vision:11b "Describe this image" --image ./photo.jpg
From the API:
import ollama
response = ollama.chat(
model='llama3.2-vision:11b',
messages=[{
'role': 'user',
'content': 'What does this chart show?',
'images': ['./chart.png']
}]
)
print(response['message']['content'])
What It Can Do
- Image captioning โ describe what’s in a photo
- OCR โ read text from images, receipts, screenshots
- Chart/graph analysis โ interpret data visualizations
- Visual Q&A โ answer specific questions about image content
- Document understanding โ extract information from scanned pages
Limitations
- Single image per prompt (no multi-image comparison)
- No video support
- The 11B is noticeably less capable than the 90B on complex visual reasoning
- Can hallucinate details in images, especially small text or crowded scenes
- Knowledge cutoff means it won’t recognize very recent products or people
VRAM
The 11B Vision fits on 8 GB at Q4 quantization. At default precision, expect ~11 GB. The 90B requires 64+ GB โ impractical for most local setups.
The Fine-Tune Ecosystem
This is Llama’s killer advantage. No other model family has as many community fine-tunes available in Ollama and on Hugging Face.
Worth Knowing About
| Fine-Tune | Base | Best For | Ollama |
|---|---|---|---|
| Dolphin 3.0 | Llama 3 | General purpose, uncensored, function calling | ollama pull dolphin-llama3 |
| Nous Hermes 3 | Llama 3.2 8B | Creative writing, roleplay, long-form | ollama pull hermes3 |
| Nemotron | Llama 3.1 70B | NVIDIA-tuned, strong benchmarks | ollama pull nemotron |
Dolphin (by Eric Hartford) is the go-to uncensored Llama model. Trained on diverse datasets, it removes refusals while maintaining quality. Dolphin 3.0 emphasizes precision and function calling.
Nous Hermes 3 from NousResearch is consistently praised as the best creative writing model in the open-source space. Uses ChatML format for structured conversations. If you want fiction, roleplay, or imaginative brainstorming, this is the model.
Nemotron is NVIDIA’s fine-tune of Llama 3.1 70B, optimized for instruction following. It benchmarks well but requires the same 70B hardware.
Fine-Tuning Your Own
If you want to create a custom Llama 3 fine-tune, the most popular tools in 2026:
- Unsloth โ 2x faster, 70% less VRAM than standard fine-tuning. Supports LoRA and QLoRA. The easiest starting point.
- Axolotl โ More configurable, supports RLHF/RLVR, good for advanced users.
- Hugging Face TRL + PEFT โ The standard library approach, most tutorials available.
You can fine-tune Llama 3.1 8B on a single 16 GB GPU with QLoRA. The 70B needs at least 48 GB for LoRA fine-tuning. For a deep dive, our fine-tuning guide is coming soon.
Setup Guide
Basic: Pull and Run
# Install Ollama if you haven't
# https://ollama.com/download
# The 8B โ fits on any modern GPU
ollama pull llama3.1:8b
ollama run llama3.1:8b
# The 3B โ for lightweight tasks
ollama pull llama3.2:3b
# The 70B โ needs 48+ GB VRAM or CPU offload with 64 GB RAM
ollama pull llama3.3:70b
# Vision โ image understanding
ollama pull llama3.2-vision:11b
Custom Modelfile
Create a file called Modelfile:
FROM llama3.1:8b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """You are a helpful coding assistant. Be concise. Show code examples when relevant. If you're unsure about something, say so."""
Then:
ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant
Recommended Quantizations
| Your VRAM | Model | Quantization |
|---|---|---|
| 4 GB | 3B | Q4_K_M |
| 6 GB | 8B | Q4_K_M (tight) |
| 8 GB | 8B | Q4_K_M (comfortable) |
| 12 GB | 8B | Q8_0 or 11B Vision Q4 |
| 16 GB | 8B | FP16 (full quality) |
| 24 GB | 8B FP16 + 11B Vision Q8 | Switch between them |
| 48 GB+ | 70B | Q4_K_M |
For how quantization affects quality, see our quantization guide.
Common Problems
“The 70B is too slow.”
If you don’t have 48+ GB of VRAM, the model spills to system RAM. That’s a 10-30x slowdown. Check with ollama ps โ if “GPU%” is below 100%, you’re offloading. Options: use Q3_K_M instead of Q4_K_M, reduce context with num_ctx 4096, or use the 8B instead.
“Which Llama 3 should I pull?”
- 8B text:
llama3.1:8b - 70B text:
llama3.3:70b(not llama3.1:70b) - Small/edge:
llama3.2:3borllama3.2:1b - Vision:
llama3.2-vision:11b
“Context is too short.”
Default Ollama context is 2048 tokens. Override it: ollama run llama3.1:8b --num-ctx 8192. Or set it in your Modelfile with PARAMETER num_ctx 8192. Going above 32K on the 8B starts eating significant VRAM โ each doubling of context roughly doubles the KV cache.
“My Llama 3 model refuses everything.”
The official instruct models have safety guardrails. For less restrictive behavior, use community fine-tunes like Dolphin (ollama pull dolphin-llama3) or Nous Hermes (ollama pull hermes3). See our uncensored models guide (coming soon) for details.
“Llama 3 keeps repeating itself.”
Lower the temperature (try 0.6-0.7) and increase repeat_penalty (try 1.1-1.2). This is more common at lower quantizations and with longer conversations. Using Q5_K_M instead of Q4_K_M can help.
“Should I use Llama 3 or Qwen 3?” For raw capability in 2026: Qwen 3 wins at most sizes. For the fine-tune ecosystem, compatibility, and community support: Llama 3. If you need a specific community model (Dolphin, Hermes, etc.), those are Llama-based. If you want the best out-of-the-box experience, try Qwen 3 first.
Bottom Line
Llama 3 isn’t the benchmark leader it was in 2024. Qwen 3 has overtaken it on most leaderboards, and DeepSeek R1 distills dominate for reasoning. But Llama remains the most important open model family for three reasons: the fine-tune ecosystem is unmatched, tool compatibility is universal, and Meta keeps shipping.
The practical recommendations:
- Tight on VRAM (4-8 GB)? Pull
llama3.1:8bat Q4_K_M. It’s the reliable workhorse. - Want the best small model? Try
qwen3:8bfirst, fall back tollama3.1:8bif you need a specific Llama fine-tune. - Have 48+ GB VRAM?
llama3.3:70bis excellent for general use. Still one of the best 70B models available. - Need vision?
llama3.2-vision:11bis the easiest multimodal setup in local AI. - Need a specialized model? Check if a Llama fine-tune exists for your use case โ it probably does.
For choosing between all the model families, see our best models for chat guide and DeepSeek models guide.