DeepSeek Models Guide: R1, V3, and Coder
๐ More on this topic: Best Models for Math & Reasoning ยท Qwen Models Guide ยท Llama 3 Guide ยท VRAM Requirements
DeepSeek made the biggest splash in local AI when R1 dropped in January 2025 โ a reasoning model that matched OpenAI’s o1 on math benchmarks, fully open-source, with distilled versions that run on a single consumer GPU.
But DeepSeek has a whole family of models now: R1, V3, V3.1, Coder V2, and more. It’s confusing. This guide cuts through the noise: which ones actually matter for local use, what hardware they need, and when you should pick something else instead.
The DeepSeek Lineup at a Glance
| Model | Type | Total Params | Active Params | Best For |
|---|---|---|---|---|
| R1 Distills | Reasoning (dense) | 1.5Bโ70B | All (dense) | Math, logic, analysis, coding puzzles |
| R1 Full | Reasoning (MoE) | 671B | 37B | Same but better โ needs datacenter hardware |
| V3 / V3.1 | General-purpose (MoE) | 671Bโ685B | 37B | Everything โ needs datacenter hardware |
| Coder V2 | Code (MoE) | 16B / 236B | 2.4B / 21B | Programming โ largely superseded |
For consumer hardware, the R1 distills are the story. Everything else is either too big to run locally or has been outpaced by newer alternatives. The rest of this guide focuses on what you can actually use.
DeepSeek R1: The Reasoning Model
How It Works
R1 generates “thinking tokens” before answering. Ask it a math problem and it produces a chain-of-thought wrapped in <think>...</think> tags โ working through the problem step by step, backtracking when it hits dead ends, verifying its own logic โ before giving you a final answer.
This was trained via pure reinforcement learning. The model learned to reason by being rewarded for correct answers, not by imitating human reasoning traces. The result: it sometimes discovers solution paths that humans wouldn’t take.
The tradeoff: Thinking tokens add overhead. A simple question might generate a few hundred thinking tokens. A hard math problem can produce tens of thousands. This makes R1 slower and more verbose than non-reasoning models โ by design.
The Distilled Versions
DeepSeek distilled R1’s reasoning ability into smaller, dense models. These are what you run locally:
| Model | Base | VRAM (Q4_K_M) | AIME ‘24 | MATH-500 | GPQA | Ollama Command |
|---|---|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | ~2 GB | 28.9% | 83.9% | 33.8% | ollama pull deepseek-r1:1.5b |
| R1-Distill-Qwen-7B | Qwen2.5-Math-7B | ~6 GB | 55.5% | 92.8% | 49.1% | ollama pull deepseek-r1:7b |
| R1-Distill-Llama-8B | Llama-3.1-8B | ~6 GB | 50.4% | 89.1% | 49.0% | ollama pull deepseek-r1:8b |
| R1-Distill-Qwen-14B | Qwen2.5-14B | ~11 GB | 69.7% | 93.9% | 59.1% | ollama pull deepseek-r1:14b |
| R1-Distill-Qwen-32B | Qwen2.5-32B | ~22 GB | 72.6% | 94.3% | 62.1% | ollama pull deepseek-r1:32b |
| R1-Distill-Llama-70B | Llama-3.3-70B | ~43 GB | 70.0% | 94.5% | 65.2% | ollama pull deepseek-r1:70b |
The 7B Qwen distill outperforms GPT-4o on several math benchmarks. The 32B distill outperforms OpenAI’s o1-mini. These are genuinely impressive numbers for models you can run on a single GPU.
Between the Qwen and Llama variants at similar sizes, pick Qwen. The 7B Qwen distill beats the 8B Llama distill on every benchmark (AIME: 55.5 vs 50.4, MATH-500: 92.8 vs 89.1).
What R1 Is Good At
- Math competitions: AIME, MATH-500, competition-level problems
- Logic puzzles: Multi-step deduction, constraint satisfaction
- Code reasoning: Algorithm design, debugging complex logic
- Analysis: Breaking down complex questions into structured reasoning
What R1 Is Bad At
- Simple chat: It overthinks. Ask “what’s the capital of France?” and it might generate 200 thinking tokens first. Use a regular chat model for casual conversation.
- Creative writing: Not conversationally fluent. The writing is functional, not engaging.
- Speed: Thinking tokens mean more output per query. Expect 2-5x more total tokens generated compared to a non-reasoning model answering the same question.
- Instruction following on non-reasoning tasks: DeepSeek themselves say “R1 falls short of V3 in general-purpose tasks.”
Recommended R1 Distill by GPU
| GPU | Best R1 Distill | Quant | Speed (est.) |
|---|---|---|---|
| 8 GB (RTX 3060 8GB, RTX 4060) | 7B Qwen | Q4_K_M | ~50-55 tok/s |
| 12 GB (RTX 3060 12GB, RTX 4070) | 14B Qwen | Q4_K_M | ~28-35 tok/s |
| 16 GB (RTX 4060 Ti 16GB, RTX 5060 Ti) | 14B Qwen | Q8_0 | ~25-30 tok/s |
| 24 GB (RTX 3090, RTX 4090) | 32B Qwen | Q4_K_M | ~15-25 tok/s |
| 48 GB+ (dual 3090, A6000) | 70B Llama | Q4_K_M | ~8-12 tok/s |
The 14B on 12 GB is the value sweet spot. You get 69.7% on AIME and 93.9% on MATH-500 on a $200 used GPU. That’s competitive with models that need $1,600 GPUs.
โ Check what fits your hardware with our Planning Tool.
Important: Increase Context Length
Ollama defaults to 4096 tokens. That’s not enough for a reasoning model โ thinking chains can easily exceed that. Create a Modelfile:
FROM deepseek-r1:14b
PARAMETER num_ctx 16384
ollama create deepseek-r1-14b-16k -f Modelfile
DeepSeek V3: The Flagship (Reality Check)
V3 is DeepSeek’s general-purpose model. It’s a 671B MoE (37B active) that matches or beats GPT-4o and Claude 3.5 Sonnet on most benchmarks. It’s genuinely one of the best open models ever released.
But you can’t run it on consumer hardware. The math:
| Quantization | Memory Needed |
|---|---|
| FP16 | ~1,400 GB |
| FP8 | ~700 GB |
| Q4_K_M | ~400 GB |
| Dynamic 1-bit (Unsloth) | ~170 GB |
Even the most aggressive quantization (1-bit, significant quality loss) needs 170 GB of memory. You could load it into system RAM on a machine with 128+ GB, offloading from a single GPU, but expect about 3-5 tok/s. Not practical for interactive use.
V3.1 (August 2025) is the current version worth knowing about. It combines V3 and R1 into a hybrid that switches between thinking and non-thinking modes. It surpasses both V3 and R1 by 40%+ on coding agent benchmarks (SWE-bench). Available as ollama pull deepseek-v3.1 โ but same hardware requirements.
If you want V3-class intelligence, use the API. DeepSeek’s API is extremely cheap ($0.27/M input tokens) compared to OpenAI or Anthropic. Or use it through OpenRouter from within Open WebUI.
DeepSeek Coder V2: Largely Superseded
Coder V2 was impressive when it launched โ 90.2% on HumanEval, matching GPT-4-Turbo on coding benchmarks. The 16B Lite version runs on an 8 GB GPU:
ollama pull deepseek-coder-v2:16b
But it’s been largely overtaken:
| DeepSeek Coder V2 (236B) | Qwen 2.5 Coder 32B | |
|---|---|---|
| HumanEval | 90.2% | Competitive |
| LiveCodeBench | 43.4% | ~65-70% |
| Architecture | MoE (21B active) | Dense 32B |
| VRAM (Q4) | ~133 GB | ~22 GB |
| Practical? | Multi-GPU only | Single 24 GB GPU |
Qwen 2.5 Coder 32B is the better choice for coding now. It outperforms DeepSeek Coder V2 on real-world coding benchmarks, runs on a single GPU, and is easier to deploy. See our best models for coding guide for current recommendations.
The 16B Lite version is still fine if you have limited VRAM and want a dedicated coding model, but Qwen3-8B in thinking mode handles code well enough that there’s less reason to run a separate coding model.
DeepSeek R1 vs the Competition
This is the question everyone asks: should you run DeepSeek R1 or Qwen3?
R1 Distills vs Qwen3 (Same Size)
| Tier | R1 Distill | Qwen3 | Reasoning Winner | General Winner |
|---|---|---|---|---|
| 7-8B | R1-Distill-Qwen-7B (MATH: 92.8%) | Qwen3-8B (MATH: ~88.8%) | R1 | Qwen3 |
| 14B | R1-Distill-Qwen-14B (MATH: 93.9%) | Qwen3-14B (MATH: ~92.6%) | R1 (slight edge) | Qwen3 |
| 32B | R1-Distill-Qwen-32B (AIME: 72.6%) | Qwen3-32B (AIME: est. high 70s-80s) | Qwen3 | Qwen3 |
At 7-8B and 14B: R1 distills win on pure math and reasoning. The gap is real โ 92.8% vs 88.8% on MATH-500 at 7B is significant. But Qwen3 is better at everything else: chat, coding, multilingual, instruction following. And Qwen3 can toggle thinking on and off, while R1 distills always reason.
At 32B: Qwen3 wins on both reasoning and general tasks. The R1-Distill-Qwen-32B’s advantage has been erased by Qwen3’s training improvements.
R1 Distills vs Llama 3.3 (Same Size)
| R1-Distill-Llama-8B | Llama 3.1 8B | |
|---|---|---|
| MATH-500 | 89.1% | ~52% |
| R1-Distill-Llama-70B | Llama 3.3 70B | |
|---|---|---|
| MATH-500 | 94.5% | 77.0% |
| GPQA | 65.2% | 50.5% |
The R1 distills destroy the base Llama models on reasoning โ it’s not even close. But Llama 3.3 70B is still better for general chat, creative writing, and instruction following where deep reasoning isn’t needed.
When to Choose DeepSeek R1
- You’re solving math problems, logic puzzles, or competition-style questions
- You want to see the model’s reasoning process (the
<think>tags are useful for learning) - You need a dedicated reasoning model for a specific workflow
- You have 12 GB VRAM and want the best possible reasoning (14B distill)
When to Choose Something Else
- General chat: Qwen3 (see our chat guide)
- Coding: Qwen 2.5 Coder 32B or Qwen3 with thinking mode
- Creative writing: Qwen3 or Llama 3.3 (see our writing guide)
- Speed matters: Any non-reasoning model โ R1’s thinking overhead makes it 2-5x slower per query
Running DeepSeek R1 in Ollama
Basic Setup
# Pull the model for your GPU
ollama pull deepseek-r1:7b # 8 GB VRAM
ollama pull deepseek-r1:14b # 12 GB VRAM
ollama pull deepseek-r1:32b # 24 GB VRAM
# Run it
ollama run deepseek-r1:14b
Recommended Settings
Create a Modelfile for optimal R1 performance:
FROM deepseek-r1:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.6
PARAMETER top_p 0.95
ollama create my-r1 -f Modelfile
ollama run my-r1
Temperature 0.6 is the sweet spot. Lower causes repetitive/degenerate output. Higher causes incoherence. The official recommendation is 0.5โ0.7.
Prompting Tips
R1 works differently than regular chat models:
Don’t use system prompts. R1 performs worse with them. Put all instructions in the user message.
Don’t use few-shot examples. They consistently degrade R1’s reasoning. Use zero-shot, direct prompts.
Don’t say “think step by step.” R1 already thinks internally. Explicit chain-of-thought prompting actually hurts performance because it conflicts with the model’s native reasoning.
Do give clear, direct questions:
Solve this: What is the largest integer n such that n^3 + 2n^2 - 5n - 6 < 0?
Common Problems
Thinking tokens cluttering the output.
The <think>...</think> tags are part of how R1 works. Most frontends (Open WebUI, SillyTavern) can hide them. In Ollama’s CLI, they show by default. If you’re piping output to another tool, strip everything between <think> and </think>.
Empty thinking tags (<think>\n\n</think>).
Sometimes R1 skips reasoning entirely, which usually degrades answer quality. This happens more on simple questions. No reliable fix โ the model decides when to think.
Excessive verbosity. R1 can generate thousands of thinking tokens, then repeat the same content in the final answer. Try adding “Be concise” to your prompt. The R1-0528 update made this worse (~23K tokens per question vs ~12K for original R1).
Slow generation. Thinking overhead is inherent. To speed things up:
- Use Q4_K_M instead of Q8 quantization
- Enable flash attention in Ollama (
OLLAMA_FLASH_ATTENTION=1) - Don’t set context length higher than you need โ more context = more VRAM = slower
- Accept that reasoning models are slower. If you need speed, use Qwen3 in non-thinking mode.
Censorship on sensitive topics. R1 has Chinese political censorship baked into its weights โ not just an API filter. It persists when run locally. Standard “uncensoring” techniques (abliteration) don’t work on R1. If this is a dealbreaker, try Perplexity’s R1-1776 (a post-trained uncensored version) or switch to Qwen3/Llama 3.3 for sensitive topics.
Ollama context too short.
Default 4096 tokens is insufficient. Reasoning chains regularly exceed this. Set num_ctx to at least 8192, ideally 16384. See the Modelfile example above.
Bottom Line
The DeepSeek R1 distills are the best reasoning models you can run on consumer hardware. The 14B on a 12 GB GPU gives you competition-level math performance that was unthinkable a year ago.
But they’re specialists. For everyday use โ chat, coding, writing โ Qwen3 is the better all-rounder at every size tier. The ideal local setup is both: Qwen3 for general tasks, R1 for when you need serious reasoning.
Start here:
- 12 GB GPU:
ollama pull deepseek-r1:14bโ the value king - 24 GB GPU:
ollama pull deepseek-r1:32bโ outperforms o1-mini - Any GPU:
ollama pull qwen3:8bโ for everything else
The full V3 is one of the best models ever made, but it needs datacenter hardware. Use the API if you want that level of intelligence, or wait for the next generation of distills.