๐Ÿ“š More on this topic: Llama 3 Guide ยท Qwen Models Guide ยท DeepSeek Models Guide ยท VRAM Requirements

Mistral AI burst onto the scene in late 2023 with a 7B model that embarrassed much larger competitors. Mixtral 8x7B introduced Mixture of Experts to the open-source world. For a while, Mistral was the default answer to “what should I run locally?”

That’s no longer true. Llama 3 and Qwen 3 have caught up and passed Mistral on most benchmarks. But Mistral models are still solid โ€” particularly Mistral Nemo 12B with its 128K context window โ€” and understanding the lineup helps you make informed choices.

This guide covers what’s worth running, what’s been superseded, and where Mistral still makes sense.


The Mistral Lineup

Every Mistral model relevant for local use:

ModelParametersContextVRAM (Q4)LicenseStatus
Mistral 7B7B8K~4 GBApache 2.0Dated but usable
Mistral Nemo 12B12B128K~8 GBApache 2.0Recommended
Codestral 22B22B32K~12 GBNon-commercialGood, but license limits use
Mixtral 8x7B46.7B (12.9B active)32K~26-32 GBApache 2.0Superseded by dense models
Mixtral 8x22B141B (39B active)64K~66 GBApache 2.0Requires serious hardware

Not locally runnable: Mistral Small, Medium, Large, and Large 2 are API-only or require datacenter hardware. Skip them for local use.


Why Mistral Still Matters

Mistral occupies a specific niche in 2026:

Apache 2.0 licensing โ€” Unlike some competitors, Mistral’s open models are genuinely open. No usage restrictions, no commercial limitations (except Codestral).

European AI โ€” Based in Paris, Mistral represents European AI development. If you care about geographic diversity in AI, that matters.

Mistral Nemo’s context โ€” 128K tokens with Apache 2.0 licensing is still relatively rare. The collaboration with NVIDIA produced a model specifically optimized for single-GPU deployment.

Legacy compatibility โ€” Many existing projects and fine-tunes are built on Mistral. If you’re using a specific Mistral-based model, understanding the base helps.

What Mistral doesn’t have: benchmark leadership. Qwen 3 and Llama 3 outperform Mistral at most comparable sizes. That’s the honest assessment.


Mistral 7B: The Original

Released September 2023, Mistral 7B was a breakthrough โ€” a 7B model outperforming Llama 2 13B on most benchmarks. It introduced sliding window attention for efficient long-context handling.

Specs

  • Parameters: 7 billion
  • Context: 8K (sliding window)
  • VRAM: ~4 GB at Q4_K_M, ~15 GB at FP16
  • License: Apache 2.0

The Reality in 2026

Mistral 7B is dated. The benchmarks that impressed in 2023 are now beaten by newer 7-8B models:

ModelMMLUNotes
Qwen 3 8B73.8%Current leader
Llama 3.1 8B73.0%Strong all-rounder
Mistral 7B~62%Shows its age

When Mistral 7B still makes sense:

  • You have exactly 4 GB VRAM and need the smallest viable model
  • You’re running a fine-tune specifically based on Mistral 7B
  • Cost matters more than quality (62.5% cheaper than Llama 3 on AWS Bedrock)

When to use something else:

  • Any scenario where you can fit Qwen 3 8B or Llama 3.1 8B instead
ollama run mistral:7b

Mistral Nemo 12B: The Current Pick

Mistral Nemo is a collaboration between Mistral AI and NVIDIA, released July 2024. It’s specifically designed for efficient single-GPU deployment with a 128K context window.

Specs

  • Parameters: 12 billion
  • Context: 128K tokens
  • Tokenizer: Tekken (trained on 100+ languages)
  • VRAM: ~8 GB at Q4, ~24 GB at FP16
  • License: Apache 2.0

What Makes Nemo Special

Quantization-aware training โ€” Mistral trained Nemo with FP8 inference in mind. Unlike most models where quantization hurts quality, Nemo maintains performance at reduced precision. This is huge for local deployment.

128K context at 12B โ€” Most 12B models top out at 32K. Nemo’s 128K window means you can feed it entire codebases, long documents, or extended conversations without truncation.

Multilingual strength โ€” The Tekken tokenizer handles 100+ languages better than most competitors. Strong on English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Benchmarks vs Competition

BenchmarkMistral Nemo 12BLlama 3.1 8B
MMLU68%73%
HellaSwag83.5%โ€”
TriviaQA73.8%โ€”
Coding (practical)StrongStrong

Llama 3.1 8B beats Nemo on MMLU despite being smaller. But Nemo has 128K context vs Llama’s 128K, and Nemo’s quantization behavior is better. For long-context tasks on limited hardware, Nemo is often the better choice.

Setup

ollama run mistral-nemo

Custom Modelfile for long context:

FROM mistral-nemo

PARAMETER num_ctx 32768
PARAMETER temperature 0.7

SYSTEM "You are a helpful assistant with access to a large context window."

Mixtral 8x7B: The MoE Pioneer

Mixtral 8x7B introduced Mixture of Experts (MoE) to mainstream local AI. Eight expert networks, two active per token, for a total of 46.7B parameters but only 12.9B active during inference.

How MoE Works

Instead of one massive feedforward network, Mixtral has 8 smaller expert networks. A router decides which 2 experts handle each token. You get the knowledge of a 47B model with the compute cost of a 13B model.

The catch: you still need VRAM for all 47B parameters. MoE saves compute, not memory.

VRAM Requirements

This is where Mixtral becomes impractical for most local users:

PrecisionVRAMHardware
FP16~90 GB2x A100 80GB
Q5_0~32 GBBeyond single RTX 3090/4090
Q4_K_M~26-28 GBStill exceeds 24 GB

Even at aggressive quantization, Mixtral 8x7B doesn’t fit on a single consumer GPU. You need dual RTX 3090s, a Mac with 48GB+ unified memory, or substantial CPU offloading (which kills speed).

Performance When You Can Run It

If you have the hardware:

HardwareSpeed
Dual RTX 4090~59 tok/s
Dual RTX 3090~54 tok/s
Mac M2 Ultra~44 tok/s
Mac M3 Max~22 tok/s

Quantization tip: Use Q5_0, not K-quants. Community testing shows Q5_K_M causes more rambling and inconsistency with Mixtral specifically. Q5_0 performs better, especially for coding tasks.

The Honest Take

In 2024, Mixtral was exciting. In 2026, if you have 32 GB+ VRAM:

  • Qwen 3 32B (dense) gives similar or better quality at ~20 GB
  • Llama 3.3 70B at heavy quantization gives better quality
  • DeepSeek R1-Distill-32B beats Mixtral on reasoning tasks

Mixtral pioneered MoE for local AI, but dense models have caught up. Unless you specifically need MoE behavior or have a Mixtral-based fine-tune, the hardware requirements aren’t justified by the performance.

ollama run mixtral:8x7b

Codestral: The Code Specialist

Codestral is Mistral’s dedicated coding model โ€” 22B parameters, 32K context, trained on 80+ programming languages.

Specs

  • Parameters: 22 billion
  • Context: 32K tokens
  • Languages: 80+ including Python, Java, C, C++, JavaScript, Bash
  • Fill-in-the-middle: Supported
  • License: Mistral AI Non-Production License (commercial license available separately)

The License Problem

Codestral’s biggest issue isn’t performance โ€” it’s licensing. The Non-Production License means you can’t use it for commercial projects without paying Mistral separately. Compare:

ModelLicenseHumanEval
Qwen 2.5 Coder 32BApache 2.088.4%
Codestral 22BNon-commercial81.1%
DeepSeek Coder 33BPermissive77.4%

Qwen 2.5 Coder is fully open, smaller, AND scores higher on HumanEval. For most users, there’s no reason to accept Codestral’s license restrictions.

When Codestral Makes Sense

  • Long-range code completion: Codestral’s 32K context gives it an edge on RepoBench (34.0% vs 28.4% for DeepSeek) โ€” useful when you need to understand large codebases
  • Personal projects: If you’re not building commercial software, the license doesn’t matter
  • IDE integration: Codestral works well with Continue and similar tools
ollama run codestral

The Better Choice

For most coding needs, Qwen 2.5 Coder is the recommendation. Higher benchmarks, no license restrictions, and active development.


Mixtral 8x22B: Serious Hardware Only

The big Mixtral โ€” 141B total parameters, 39B active per token, 64K context.

VRAM Requirements

PrecisionVRAM
FP16~260-300 GB
Q4~66 GB

This is multi-A100 or 4x RTX 4090 territory. For home users, it’s not realistic.

If you have access to this hardware, Mixtral 8x22B delivers:

  • 90% on GSM8K (maj@8)
  • Strong coding (HumanEval)
  • Competitive with much larger models

But at this VRAM tier, you’re comparing against Llama 3.3 70B and considering the full DeepSeek R1. The MoE advantage doesn’t justify the complexity for most use cases.


VRAM Requirements Table

Complete reference for all Mistral models:

ModelQ3_K_MQ4_K_MQ5_K_MQ8_0FP16
Mistral 7B~3 GB~4 GB~5 GB~7 GB~15 GB
Mistral Nemo 12B~6 GB~8 GB~9 GB~13 GB~24 GB
Codestral 22B~10 GB~12 GB~15 GB~23 GB~44 GB
Mixtral 8x7B~22 GB~26 GB~32 GB~48 GB~90 GB
Mixtral 8x22B~50 GB~66 GB~80 GB~140 GB~260 GB

What to run on your GPU:

GPUVRAMBest Mistral Model
RTX 30608/12 GBMistral 7B or Nemo Q4
RTX 4060 Ti16 GBMistral Nemo Q5
RTX 3090/409024 GBMistral Nemo FP16 or Codestral Q4
2x RTX 309048 GBMixtral 8x7B Q4
Mac M2/M3 Ultra64-192 GBMixtral 8x7B or 8x22B

โ†’ Use our Planning Tool to check exact VRAM for your setup.


Mistral vs the Competition

Honest positioning of Mistral models against alternatives:

At 7-8B (Tight VRAM)

ModelMMLUBest For
Qwen 3 8B73.8%Overall winner
Llama 3.1 8B73.0%Fine-tune ecosystem
Mistral 7B~62%Legacy compatibility, lowest cost

Verdict: Mistral 7B is outclassed. Use Qwen 3 8B or Llama 3.1 8B unless you have a specific reason not to.

At 12-14B (Mid-Range)

ModelContextVRAM (Q4)Notes
Mistral Nemo 12B128K~8 GBBest context-per-VRAM
Qwen 3 14B32K+~10 GBBetter benchmarks
Llama 3.1 8B128K~5 GBSmaller but competitive

Verdict: Mistral Nemo is competitive here. If you need 128K context on limited VRAM, it’s a solid choice. For raw capability, Qwen 3 14B edges it out.

At 30-50B (High-End Consumer)

ModelTypeVRAM (Q4)Notes
Qwen 3 32BDense~20 GBBest on 24 GB
Mixtral 8x7BMoE~26-32 GBNeeds dual GPU
DeepSeek R1-32BDense~18 GBBest for reasoning

Verdict: Mixtral loses to dense models on practical deployments. The MoE overhead isn’t worth it anymore.

For Coding

ModelHumanEvalLicenseRecommendation
Qwen 2.5 Coder 32B88.4%Apache 2.0Best overall
Codestral 22B81.1%Non-commercialOnly if license is OK
DeepSeek Coder 33B77.4%PermissiveDecent alternative

Verdict: Qwen 2.5 Coder wins. Codestral’s license kills it for most use cases.


Setup Guide

Quick Start

# The recommendation for most users
ollama run mistral-nemo

# Original 7B for tight VRAM
ollama run mistral:7b

# MoE if you have the hardware
ollama run mixtral:8x7b

# Coding (check license first)
ollama run codestral

Modelfile for Mistral Nemo

FROM mistral-nemo

# Maximize context (adjust based on VRAM)
PARAMETER num_ctx 32768

# Slightly lower temperature for consistency
PARAMETER temperature 0.6

# Stop sequences
PARAMETER stop "<|endoftext|>"
PARAMETER stop "</s>"

SYSTEM "You are a helpful, concise assistant."
ollama create my-nemo -f Modelfile
ollama run my-nemo

Mixtral-Specific Settings

If running Mixtral, avoid K-quants:

# Pull Q5_0 specifically (community recommendation)
ollama pull mixtral:8x7b-instruct-v0.1-q5_0

Bottom Line

Mistral was the model to run in 2023-2024. In 2026, the landscape has shifted:

Still worth running:

  • Mistral Nemo 12B โ€” Best-in-class for 128K context on limited VRAM, Apache 2.0
  • Mistral 7B โ€” Only if you have <6 GB VRAM or need the absolute cheapest option

Superseded:

  • Mixtral 8x7B โ€” Dense models (Qwen 3 32B) give better value at the same VRAM tier
  • Codestral โ€” Qwen 2.5 Coder beats it AND has a better license

Not practical for local:

  • Mixtral 8x22B โ€” Datacenter hardware required
  • Mistral Large/Small/Medium โ€” API-only

If you’re starting fresh, Qwen 3 or Llama 3 should be your first choice. If you specifically need long context on modest hardware, Mistral Nemo earns its place.

# The Mistral model worth running
ollama run mistral-nemo