Mistral & Mixtral Guide: Every Model Worth Running Locally
๐ More on this topic: Llama 3 Guide ยท Qwen Models Guide ยท DeepSeek Models Guide ยท VRAM Requirements
Mistral AI burst onto the scene in late 2023 with a 7B model that embarrassed much larger competitors. Mixtral 8x7B introduced Mixture of Experts to the open-source world. For a while, Mistral was the default answer to “what should I run locally?”
That’s no longer true. Llama 3 and Qwen 3 have caught up and passed Mistral on most benchmarks. But Mistral models are still solid โ particularly Mistral Nemo 12B with its 128K context window โ and understanding the lineup helps you make informed choices.
This guide covers what’s worth running, what’s been superseded, and where Mistral still makes sense.
The Mistral Lineup
Every Mistral model relevant for local use:
| Model | Parameters | Context | VRAM (Q4) | License | Status |
|---|---|---|---|---|---|
| Mistral 7B | 7B | 8K | ~4 GB | Apache 2.0 | Dated but usable |
| Mistral Nemo 12B | 12B | 128K | ~8 GB | Apache 2.0 | Recommended |
| Codestral 22B | 22B | 32K | ~12 GB | Non-commercial | Good, but license limits use |
| Mixtral 8x7B | 46.7B (12.9B active) | 32K | ~26-32 GB | Apache 2.0 | Superseded by dense models |
| Mixtral 8x22B | 141B (39B active) | 64K | ~66 GB | Apache 2.0 | Requires serious hardware |
Not locally runnable: Mistral Small, Medium, Large, and Large 2 are API-only or require datacenter hardware. Skip them for local use.
Why Mistral Still Matters
Mistral occupies a specific niche in 2026:
Apache 2.0 licensing โ Unlike some competitors, Mistral’s open models are genuinely open. No usage restrictions, no commercial limitations (except Codestral).
European AI โ Based in Paris, Mistral represents European AI development. If you care about geographic diversity in AI, that matters.
Mistral Nemo’s context โ 128K tokens with Apache 2.0 licensing is still relatively rare. The collaboration with NVIDIA produced a model specifically optimized for single-GPU deployment.
Legacy compatibility โ Many existing projects and fine-tunes are built on Mistral. If you’re using a specific Mistral-based model, understanding the base helps.
What Mistral doesn’t have: benchmark leadership. Qwen 3 and Llama 3 outperform Mistral at most comparable sizes. That’s the honest assessment.
Mistral 7B: The Original
Released September 2023, Mistral 7B was a breakthrough โ a 7B model outperforming Llama 2 13B on most benchmarks. It introduced sliding window attention for efficient long-context handling.
Specs
- Parameters: 7 billion
- Context: 8K (sliding window)
- VRAM: ~4 GB at Q4_K_M, ~15 GB at FP16
- License: Apache 2.0
The Reality in 2026
Mistral 7B is dated. The benchmarks that impressed in 2023 are now beaten by newer 7-8B models:
| Model | MMLU | Notes |
|---|---|---|
| Qwen 3 8B | 73.8% | Current leader |
| Llama 3.1 8B | 73.0% | Strong all-rounder |
| Mistral 7B | ~62% | Shows its age |
When Mistral 7B still makes sense:
- You have exactly 4 GB VRAM and need the smallest viable model
- You’re running a fine-tune specifically based on Mistral 7B
- Cost matters more than quality (62.5% cheaper than Llama 3 on AWS Bedrock)
When to use something else:
- Any scenario where you can fit Qwen 3 8B or Llama 3.1 8B instead
ollama run mistral:7b
Mistral Nemo 12B: The Current Pick
Mistral Nemo is a collaboration between Mistral AI and NVIDIA, released July 2024. It’s specifically designed for efficient single-GPU deployment with a 128K context window.
Specs
- Parameters: 12 billion
- Context: 128K tokens
- Tokenizer: Tekken (trained on 100+ languages)
- VRAM: ~8 GB at Q4, ~24 GB at FP16
- License: Apache 2.0
What Makes Nemo Special
Quantization-aware training โ Mistral trained Nemo with FP8 inference in mind. Unlike most models where quantization hurts quality, Nemo maintains performance at reduced precision. This is huge for local deployment.
128K context at 12B โ Most 12B models top out at 32K. Nemo’s 128K window means you can feed it entire codebases, long documents, or extended conversations without truncation.
Multilingual strength โ The Tekken tokenizer handles 100+ languages better than most competitors. Strong on English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
Benchmarks vs Competition
| Benchmark | Mistral Nemo 12B | Llama 3.1 8B |
|---|---|---|
| MMLU | 68% | 73% |
| HellaSwag | 83.5% | โ |
| TriviaQA | 73.8% | โ |
| Coding (practical) | Strong | Strong |
Llama 3.1 8B beats Nemo on MMLU despite being smaller. But Nemo has 128K context vs Llama’s 128K, and Nemo’s quantization behavior is better. For long-context tasks on limited hardware, Nemo is often the better choice.
Setup
ollama run mistral-nemo
Custom Modelfile for long context:
FROM mistral-nemo
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant with access to a large context window."
Mixtral 8x7B: The MoE Pioneer
Mixtral 8x7B introduced Mixture of Experts (MoE) to mainstream local AI. Eight expert networks, two active per token, for a total of 46.7B parameters but only 12.9B active during inference.
How MoE Works
Instead of one massive feedforward network, Mixtral has 8 smaller expert networks. A router decides which 2 experts handle each token. You get the knowledge of a 47B model with the compute cost of a 13B model.
The catch: you still need VRAM for all 47B parameters. MoE saves compute, not memory.
VRAM Requirements
This is where Mixtral becomes impractical for most local users:
| Precision | VRAM | Hardware |
|---|---|---|
| FP16 | ~90 GB | 2x A100 80GB |
| Q5_0 | ~32 GB | Beyond single RTX 3090/4090 |
| Q4_K_M | ~26-28 GB | Still exceeds 24 GB |
Even at aggressive quantization, Mixtral 8x7B doesn’t fit on a single consumer GPU. You need dual RTX 3090s, a Mac with 48GB+ unified memory, or substantial CPU offloading (which kills speed).
Performance When You Can Run It
If you have the hardware:
| Hardware | Speed |
|---|---|
| Dual RTX 4090 | ~59 tok/s |
| Dual RTX 3090 | ~54 tok/s |
| Mac M2 Ultra | ~44 tok/s |
| Mac M3 Max | ~22 tok/s |
Quantization tip: Use Q5_0, not K-quants. Community testing shows Q5_K_M causes more rambling and inconsistency with Mixtral specifically. Q5_0 performs better, especially for coding tasks.
The Honest Take
In 2024, Mixtral was exciting. In 2026, if you have 32 GB+ VRAM:
- Qwen 3 32B (dense) gives similar or better quality at ~20 GB
- Llama 3.3 70B at heavy quantization gives better quality
- DeepSeek R1-Distill-32B beats Mixtral on reasoning tasks
Mixtral pioneered MoE for local AI, but dense models have caught up. Unless you specifically need MoE behavior or have a Mixtral-based fine-tune, the hardware requirements aren’t justified by the performance.
ollama run mixtral:8x7b
Codestral: The Code Specialist
Codestral is Mistral’s dedicated coding model โ 22B parameters, 32K context, trained on 80+ programming languages.
Specs
- Parameters: 22 billion
- Context: 32K tokens
- Languages: 80+ including Python, Java, C, C++, JavaScript, Bash
- Fill-in-the-middle: Supported
- License: Mistral AI Non-Production License (commercial license available separately)
The License Problem
Codestral’s biggest issue isn’t performance โ it’s licensing. The Non-Production License means you can’t use it for commercial projects without paying Mistral separately. Compare:
| Model | License | HumanEval |
|---|---|---|
| Qwen 2.5 Coder 32B | Apache 2.0 | 88.4% |
| Codestral 22B | Non-commercial | 81.1% |
| DeepSeek Coder 33B | Permissive | 77.4% |
Qwen 2.5 Coder is fully open, smaller, AND scores higher on HumanEval. For most users, there’s no reason to accept Codestral’s license restrictions.
When Codestral Makes Sense
- Long-range code completion: Codestral’s 32K context gives it an edge on RepoBench (34.0% vs 28.4% for DeepSeek) โ useful when you need to understand large codebases
- Personal projects: If you’re not building commercial software, the license doesn’t matter
- IDE integration: Codestral works well with Continue and similar tools
ollama run codestral
The Better Choice
For most coding needs, Qwen 2.5 Coder is the recommendation. Higher benchmarks, no license restrictions, and active development.
Mixtral 8x22B: Serious Hardware Only
The big Mixtral โ 141B total parameters, 39B active per token, 64K context.
VRAM Requirements
| Precision | VRAM |
|---|---|
| FP16 | ~260-300 GB |
| Q4 | ~66 GB |
This is multi-A100 or 4x RTX 4090 territory. For home users, it’s not realistic.
If you have access to this hardware, Mixtral 8x22B delivers:
- 90% on GSM8K (maj@8)
- Strong coding (HumanEval)
- Competitive with much larger models
But at this VRAM tier, you’re comparing against Llama 3.3 70B and considering the full DeepSeek R1. The MoE advantage doesn’t justify the complexity for most use cases.
VRAM Requirements Table
Complete reference for all Mistral models:
| Model | Q3_K_M | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|---|
| Mistral 7B | ~3 GB | ~4 GB | ~5 GB | ~7 GB | ~15 GB |
| Mistral Nemo 12B | ~6 GB | ~8 GB | ~9 GB | ~13 GB | ~24 GB |
| Codestral 22B | ~10 GB | ~12 GB | ~15 GB | ~23 GB | ~44 GB |
| Mixtral 8x7B | ~22 GB | ~26 GB | ~32 GB | ~48 GB | ~90 GB |
| Mixtral 8x22B | ~50 GB | ~66 GB | ~80 GB | ~140 GB | ~260 GB |
What to run on your GPU:
| GPU | VRAM | Best Mistral Model |
|---|---|---|
| RTX 3060 | 8/12 GB | Mistral 7B or Nemo Q4 |
| RTX 4060 Ti | 16 GB | Mistral Nemo Q5 |
| RTX 3090/4090 | 24 GB | Mistral Nemo FP16 or Codestral Q4 |
| 2x RTX 3090 | 48 GB | Mixtral 8x7B Q4 |
| Mac M2/M3 Ultra | 64-192 GB | Mixtral 8x7B or 8x22B |
โ Use our Planning Tool to check exact VRAM for your setup.
Mistral vs the Competition
Honest positioning of Mistral models against alternatives:
At 7-8B (Tight VRAM)
| Model | MMLU | Best For |
|---|---|---|
| Qwen 3 8B | 73.8% | Overall winner |
| Llama 3.1 8B | 73.0% | Fine-tune ecosystem |
| Mistral 7B | ~62% | Legacy compatibility, lowest cost |
Verdict: Mistral 7B is outclassed. Use Qwen 3 8B or Llama 3.1 8B unless you have a specific reason not to.
At 12-14B (Mid-Range)
| Model | Context | VRAM (Q4) | Notes |
|---|---|---|---|
| Mistral Nemo 12B | 128K | ~8 GB | Best context-per-VRAM |
| Qwen 3 14B | 32K+ | ~10 GB | Better benchmarks |
| Llama 3.1 8B | 128K | ~5 GB | Smaller but competitive |
Verdict: Mistral Nemo is competitive here. If you need 128K context on limited VRAM, it’s a solid choice. For raw capability, Qwen 3 14B edges it out.
At 30-50B (High-End Consumer)
| Model | Type | VRAM (Q4) | Notes |
|---|---|---|---|
| Qwen 3 32B | Dense | ~20 GB | Best on 24 GB |
| Mixtral 8x7B | MoE | ~26-32 GB | Needs dual GPU |
| DeepSeek R1-32B | Dense | ~18 GB | Best for reasoning |
Verdict: Mixtral loses to dense models on practical deployments. The MoE overhead isn’t worth it anymore.
For Coding
| Model | HumanEval | License | Recommendation |
|---|---|---|---|
| Qwen 2.5 Coder 32B | 88.4% | Apache 2.0 | Best overall |
| Codestral 22B | 81.1% | Non-commercial | Only if license is OK |
| DeepSeek Coder 33B | 77.4% | Permissive | Decent alternative |
Verdict: Qwen 2.5 Coder wins. Codestral’s license kills it for most use cases.
Setup Guide
Quick Start
# The recommendation for most users
ollama run mistral-nemo
# Original 7B for tight VRAM
ollama run mistral:7b
# MoE if you have the hardware
ollama run mixtral:8x7b
# Coding (check license first)
ollama run codestral
Modelfile for Mistral Nemo
FROM mistral-nemo
# Maximize context (adjust based on VRAM)
PARAMETER num_ctx 32768
# Slightly lower temperature for consistency
PARAMETER temperature 0.6
# Stop sequences
PARAMETER stop "<|endoftext|>"
PARAMETER stop "</s>"
SYSTEM "You are a helpful, concise assistant."
ollama create my-nemo -f Modelfile
ollama run my-nemo
Mixtral-Specific Settings
If running Mixtral, avoid K-quants:
# Pull Q5_0 specifically (community recommendation)
ollama pull mixtral:8x7b-instruct-v0.1-q5_0
Bottom Line
Mistral was the model to run in 2023-2024. In 2026, the landscape has shifted:
Still worth running:
- Mistral Nemo 12B โ Best-in-class for 128K context on limited VRAM, Apache 2.0
- Mistral 7B โ Only if you have <6 GB VRAM or need the absolute cheapest option
Superseded:
- Mixtral 8x7B โ Dense models (Qwen 3 32B) give better value at the same VRAM tier
- Codestral โ Qwen 2.5 Coder beats it AND has a better license
Not practical for local:
- Mixtral 8x22B โ Datacenter hardware required
- Mistral Large/Small/Medium โ API-only
If you’re starting fresh, Qwen 3 or Llama 3 should be your first choice. If you specifically need long context on modest hardware, Mistral Nemo earns its place.
# The Mistral model worth running
ollama run mistral-nemo