Running LLMs on Mac: M1 Through M4 Guide
๐ More on this topic: Mac vs PC for Local AI ยท Ollama vs LM Studio ยท Run Your First Local LLM ยท VRAM Requirements
Apple Silicon Macs have a superpower for local AI: unified memory. Unlike a PC where GPU VRAM is separate (and limited to 24GB on consumer cards), your Mac’s entire RAM pool is available to both CPU and GPU. An M4 Max with 128GB can load models that would require a $3,000+ multi-GPU PC setup.
The tradeoff is speed โ an RTX 3090 generates tokens faster for models that fit in its 24GB. But for models that don’t fit in 24GB, Mac wins by running them at all.
This guide covers what you can actually run on each M-series chip, which tools to use, realistic performance expectations, and how to set up a Mac Mini as an always-on AI server.
M-Series Chips at a Glance
| Chip | Memory Options | Memory Bandwidth | GPU Cores | Best For |
|---|---|---|---|---|
| M1 | 8-16 GB | 68.25 GB/s | 7-8 | 3B-7B models, light use |
| M1 Pro | 16-32 GB | 200 GB/s | 14-16 | 8B-14B models |
| M1 Max | 32-64 GB | 400 GB/s | 24-32 | 14B-32B models |
| M1 Ultra | 64-128 GB | 800 GB/s | 48-64 | 32B-70B models |
| M2 | 8-24 GB | 100 GB/s | 8-10 | 7B-8B models |
| M2 Pro | 16-32 GB | 200 GB/s | 16-19 | 8B-14B models |
| M2 Max | 32-96 GB | 400 GB/s | 30-38 | 14B-32B models |
| M2 Ultra | 64-192 GB | 800 GB/s | 60-76 | 32B-70B+ models |
| M3 | 8-24 GB | 100 GB/s | 8-10 | 7B-8B models |
| M3 Pro | 18-36 GB | 150 GB/s | 11-14 | 8B-14B models |
| M3 Max | 36-128 GB | 300-400 GB/s | 30-40 | 14B-70B models |
| M4 | 16-32 GB | 120 GB/s | 10 | 7B-14B models |
| M4 Pro | 24-64 GB | 273 GB/s | 16-20 | 14B-32B models |
| M4 Max | 36-128 GB | 546 GB/s | 40 | 32B-70B+ models |
Key insight: Memory bandwidth determines how fast tokens generate. The M4 Max at 546 GB/s generates tokens roughly 5x faster than the base M4 at 120 GB/s for the same model.
What Can You Run by Memory Tier
8GB Unified Memory
Models that fit: 3B-7B at Q4 quantization
| Model | Size | Performance |
|---|---|---|
| Llama 3.2 3B | ~2 GB | 25-35 tok/s (M1), 30-45 tok/s (M4) |
| Phi-4 Mini | ~2.3 GB | 25-40 tok/s |
| Qwen 2.5 3B | ~2 GB | 25-40 tok/s |
| Mistral 7B Q4 | ~4.5 GB | 12-18 tok/s (tight fit) |
| Llama 3.1 8B Q3 | ~4 GB | 10-15 tok/s (quality tradeoff) |
With 8GB, you’re in small model territory. The system needs 2-3GB for macOS itself, leaving 5-6GB for models. 7B models at Q4 (~4.5GB) fit but leave little room for context. Stick to 3B models for comfortable use, or use aggressive quantization (Q3/Q2) for 7B.
Realistic experience: 8GB Macs work for casual use with small models. Don’t expect to run coding assistants or models that need long context.
16GB Unified Memory
Models that fit: 7B-8B comfortably, 13B-14B at Q4 (tight)
| Model | Size | Performance |
|---|---|---|
| Llama 3.1 8B Q4 | ~4.5 GB | 25-40 tok/s |
| Mistral 7B Q6 | ~5.5 GB | 25-40 tok/s |
| Qwen 2.5 7B Q4 | ~4.5 GB | 25-40 tok/s |
| DeepSeek R1 Distill 8B | ~4.5 GB | 25-40 tok/s |
| Llama 3.1 14B Q4 | ~8 GB | 15-25 tok/s (needs reduced context) |
16GB is the sweet spot for 7B-8B models. You have room for the model plus healthy context (8K-16K tokens). 14B models fit but require reduced context length (4K or less) or lower quantization.
Realistic experience: This is the minimum for serious local LLM use. 8B models like Llama 3.1 8B and Qwen 2.5 7B are genuinely useful for coding, writing, and general assistance.
24GB Unified Memory
Models that fit: 8B at high quality, 14B comfortably, 32B at Q3-Q4 (tight)
| Model | Size | Performance |
|---|---|---|
| Llama 3.1 8B Q8 | ~8.5 GB | 25-45 tok/s |
| Mistral Nemo 12B Q4 | ~7.5 GB | 20-35 tok/s |
| Qwen 2.5 14B Q4 | ~8.5 GB | 18-30 tok/s |
| DeepSeek R1 Distill 14B Q4 | ~8.5 GB | 18-30 tok/s |
| Qwen 2.5 32B Q3 | ~15 GB | 8-15 tok/s |
24GB opens up the 14B tier properly. You can run Qwen 2.5 14B, DeepSeek R1 Distill 14B, and similar models with room for 8K+ context. 32B models barely fit at Q3 with minimal context.
Realistic experience: This is the Mac Mini M4 Pro base config and a great entry point. 14B models offer noticeably better reasoning than 8B.
36-48GB Unified Memory
Models that fit: 14B at high quality, 32B comfortably, 70B at Q2-Q3 (tight)
| Model | Size | Performance |
|---|---|---|
| Qwen 2.5 14B Q8 | ~15 GB | 18-35 tok/s |
| Qwen 2.5 32B Q4 | ~20 GB | 12-22 tok/s |
| Llama 3.3 70B Q2 | ~30 GB | 5-10 tok/s (quality tradeoff) |
| DeepSeek R1 Distill 32B Q4 | ~20 GB | 12-22 tok/s |
| Mixtral 8x7B Q4 | ~26 GB | 15-25 tok/s |
The 32B tier becomes practical. Models like Qwen 2.5 32B and DeepSeek R1 Distill 32B run with good context windows. 70B is technically possible at Q2 but the quality loss is significant.
Realistic experience: 32B models are a major step up in capability. This is where you start getting expert-level responses on complex topics.
64-96GB Unified Memory
Models that fit: 32B at high quality, 70B at Q4 comfortably
| Model | Size | Performance |
|---|---|---|
| Qwen 2.5 32B Q6 | ~26 GB | 12-25 tok/s |
| Llama 3.1 70B Q4 | ~40 GB | 8-15 tok/s |
| Qwen 2.5 72B Q4 | ~42 GB | 8-14 tok/s |
| DeepSeek R1 Distill 70B Q4 | ~40 GB | 8-14 tok/s |
| Mixtral 8x22B Q4 | ~80 GB | 5-10 tok/s |
This is the sweet spot for 70B models. You have room for 40GB of model weights plus generous context (32K+ tokens). The M3 Max 96GB and M4 Max 64GB configurations hit this tier.
Realistic experience: 70B models are genuinely impressive โ they match or exceed GPT-3.5 on most tasks. 8-15 tok/s is slower than reading speed but perfectly usable for interactive chat.
128GB+ Unified Memory
Models that fit: 70B at high quality, 100B+
| Model | Size | Performance |
|---|---|---|
| Llama 3.1 70B Q6 | ~55 GB | 8-15 tok/s |
| Qwen 2.5 72B Q8 | ~75 GB | 8-12 tok/s |
| Qwen3 235B Q3 | ~88 GB | 3-5 tok/s |
| Llama 3.1 405B Q2 | ~150 GB | Not practical (too slow) |
At 128GB, you’re in territory no single consumer GPU can reach. The M4 Max 128GB configuration costs ~$3,500 and runs 70B models that would require $1,600+ in dual GPUs on PC. For models above 100B parameters, you’re looking at M3 Ultra with 192GB+ ($5,500+).
Realistic experience: This is the “money is no object, I want the biggest models” tier. 70B at Q6/Q8 is noticeably better than Q4. Models above 100B are possible but slow.
โ Check what fits your hardware with our Planning Tool.
Which Tool Should You Use?
Ollama: Simplicity
Ollama is the easiest way to run local LLMs on Mac. One install, one command, you’re running.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run a model
ollama run llama3.1
Pros:
- Dead simple to use
- Automatic GPU acceleration via Metal
- Handles model downloads and updates
- API server for integrations
Cons:
- Slightly slower than MLX (10-20%)
- Less control over advanced settings
- Model library limited to what’s on ollama.com
Use Ollama when: You want to run models with minimal setup, or you need an API server for other apps.
LM Studio: Visual Interface
LM Studio gives you a ChatGPT-style interface with full control over settings.
Pros:
- Nice GUI with conversation management
- Direct HuggingFace model downloads
- Fine control over temperature, context, sampling
- Built-in server mode
Cons:
- Heavier than command-line tools
- Same underlying speed as Ollama (llama.cpp backend)
Use LM Studio when: You prefer a visual interface, want to browse and download models directly, or need precise parameter control.
MLX: Maximum Speed
MLX is Apple’s machine learning framework built specifically for unified memory. It’s 20-50% faster than llama.cpp on Apple Silicon.
# Install
pip install mlx-lm
# Download and run
mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit --prompt "Hello"
Pros:
- Fastest inference on Apple Silicon
- Optimized for unified memory
- Growing model library (mlx-community on HuggingFace)
Cons:
- Requires Python knowledge
- Smaller ecosystem than llama.cpp
- Some features still catching up
Use MLX when: You want maximum speed and are comfortable with Python, or you’re building applications on Mac.
Speed Comparison
| Tool | 8B Q4 on M4 Max | Notes |
|---|---|---|
| MLX | ~95-110 tok/s | Fastest |
| Ollama / llama.cpp | ~75-85 tok/s | Most compatible |
| LM Studio | ~75-85 tok/s | Same as Ollama |
The 20-30% speed advantage of MLX is noticeable but not transformative. For most users, Ollama’s simplicity wins over MLX’s speed.
Metal Acceleration: What You Need to Know
Metal is Apple’s GPU framework, equivalent to NVIDIA’s CUDA. Every M-series Mac supports Metal, and every local LLM tool uses it automatically.
You don’t need to do anything to enable it. When you run Ollama, LM Studio, or MLX, Metal acceleration is on by default.
Verifying Metal Is Working
In Ollama:
ollama ps
# Should show "GPU" in the Processor column, not "CPU"
In LM Studio: Check the bottom status bar โ it shows GPU usage.
In Activity Monitor: Open GPU History (Window โ GPU History). You should see activity when generating.
When Metal Doesn’t Work
Metal issues are rare, but can happen:
- macOS version too old: Metal for LLMs requires macOS 12.6+ (M1) or 13.3+ (M2/M3/M4). Update if you’re behind.
- Tool version too old: Update Ollama/LM Studio to the latest version.
- Memory pressure: If macOS is swapping heavily, performance collapses. Close other apps.
Realistic Performance Expectations
Let’s be honest about what you’re getting.
Speed vs. NVIDIA
| Model | M4 Max 40c | RTX 3090 | Difference |
|---|---|---|---|
| 8B Q4 | ~83 tok/s | ~100 tok/s | NVIDIA 20% faster |
| 14B Q4 | ~38 tok/s | ~55 tok/s | NVIDIA 45% faster |
| 32B Q4 | ~20 tok/s | ~40 tok/s | NVIDIA 100% faster |
| 70B Q4 | ~10 tok/s | ~3 tok/s (offload) | Mac 3x faster |
For models up to 32B, an RTX 3090 is faster. At 70B+, Mac wins because NVIDIA has to offload to system RAM over PCIe, killing performance.
What Speeds Feel Like
| Speed | Experience |
|---|---|
| 80+ tok/s | Instant โ faster than you can read |
| 40-80 tok/s | Very responsive โ slight typing delay |
| 20-40 tok/s | Comfortable โ noticeable but not annoying |
| 10-20 tok/s | Acceptable โ clear delay but usable |
| 5-10 tok/s | Slow โ patience required |
| <5 tok/s | Painful โ only for batch jobs |
Most Mac users land in the 15-50 tok/s range depending on model size. That’s perfectly usable for interactive chat.
Prompt Processing (Prefill)
Prompt processing โ how fast the model reads your input โ is where Mac struggles most. NVIDIA’s compute advantage is 5-10x here.
For short prompts, you won’t notice. For RAG with long documents or code analysis with large files, prefill times can be frustrating on Mac.
| Prompt Length | M4 Max | RTX 3090 |
|---|---|---|
| 500 tokens | 1-2 sec | <1 sec |
| 2,000 tokens | 3-5 sec | <1 sec |
| 8,000 tokens | 15-30 sec | 2-4 sec |
Mac Mini as an Always-On AI Server
The Mac Mini is quietly one of the best local AI servers you can buy:
- Small and silent: No fan noise at idle, quiet under load
- Low power: 5-15W idle, 30-60W under AI load
- Unified memory: Run larger models than any GPU-based server
- macOS reliability: Set it up and forget it
Recommended Configuration
| Use Case | Config | Price | What It Runs |
|---|---|---|---|
| Budget AI server | Mac Mini M4 16GB | $599 | 7B-8B models |
| Balanced | Mac Mini M4 Pro 24GB | $1,399 | 8B-14B models |
| Power user | Mac Mini M4 Pro 48GB | $1,799 | 14B-32B models |
| Maximum | Mac Mini M4 Pro 64GB | $1,999 | 32B models, 70B tight |
For most home server use, the Mac Mini M4 Pro 48GB at $1,799 hits the sweet spot โ enough memory for 32B models and quiet enough to sit in your living room.
Setup as a Headless Server
- Enable Remote Login: System Settings โ General โ Sharing โ Remote Login
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Enable network access:
launchctl setenv OLLAMA_HOST "0.0.0.0:11434" - Set to always-on: System Settings โ Energy โ Prevent automatic sleeping
Now you can access your Mac Mini from any device on your network:
curl http://mac-mini-ip:11434/api/generate -d '{"model": "llama3.1", "prompt": "Hello"}'
Power Consumption
| State | Power Draw | Annual Cost (US avg $0.12/kWh) |
|---|---|---|
| Idle | 5-7W | ~$6/year |
| Light AI load | 15-25W | ~$20/year |
| Heavy AI load | 30-60W | ~$35/year |
Compare to a PC with an RTX 3090: 150-350W under AI load, ~$200-400/year. The Mac Mini is dramatically cheaper to run 24/7.
Tips for Better Performance
1. Close Memory-Hungry Apps
Safari, Chrome, and Electron apps (Slack, Discord, VS Code) consume significant memory. Each GB they use is a GB you can’t use for models.
Check memory pressure in Activity Monitor. If it’s yellow or red, close applications.
2. Use Appropriate Quantization
Don’t run Q8 models if Q4 fits your needs. The quality difference is subtle; the memory savings are substantial.
| Quantization | Quality | Memory vs FP16 |
|---|---|---|
| Q8_0 | ~99% | ~50% |
| Q6_K | ~98% | ~42% |
| Q5_K_M | ~96% | ~35% |
| Q4_K_M | ~94% | ~28% |
| Q3_K_M | ~88% | ~22% |
For most tasks, Q4_K_M is the sweet spot. Use Q5 or Q6 for coding or reasoning-heavy tasks where precision matters.
3. Adjust Context Length
Longer context = more memory. If you’re hitting limits, reduce context:
ollama run llama3.1 /set parameter num_ctx 4096
4K context is plenty for most conversations. Only use 8K+ when you actually need it.
4. Use MLX for Speed-Critical Workflows
If you’re running the same model repeatedly and every millisecond counts, MLX’s 20-30% speed advantage adds up.
5. Keep macOS Updated
Apple regularly improves Metal performance. macOS updates have delivered meaningful speed improvements for AI workloads.
Troubleshooting
Model Won’t Load (Out of Memory)
Your model is too large for available memory.
Fixes:
- Close other applications
- Use a smaller quantization (Q4 instead of Q6)
- Use a smaller model
- Reduce context length
Check available memory: Activity Monitor โ Memory โ Memory Pressure should be green.
Painfully Slow Generation
Check that Metal is being used (see verification section above). If it’s on CPU, something is wrong.
Common causes:
- macOS too old โ update
- Tool misconfigured โ reinstall
- Heavy memory pressure โ close apps
Garbled or Wrong Output
Usually a corrupted model download.
Fix:
ollama rm llama3.1
ollama pull llama3.1
For more issues, see our Local AI Troubleshooting Guide.
Which Mac Should You Buy for Local AI?
| Budget | Recommendation | What You Get |
|---|---|---|
| $599 | Mac Mini M4 16GB | 7B-8B models, basic use |
| $1,399 | Mac Mini M4 Pro 24GB | 8B-14B models, good balance |
| $1,799 | Mac Mini M4 Pro 48GB | 14B-32B models, best value for AI |
| $2,700 | Mac Studio M4 Max 64GB | 32B-70B models |
| $3,500 | Mac Studio M4 Max 128GB | 70B+ models, no compromises |
| $4,000+ | MacBook Pro M4 Max 48-128GB | Portable large model inference |
The sweet spot: Mac Mini M4 Pro 48GB at $1,799. It runs 32B models comfortably, is silent, uses minimal power, and costs less than a used RTX 3090 PC.
The Bottom Line
Apple Silicon Macs are genuinely good for local LLMs. The unified memory architecture lets you load models that would require expensive multi-GPU setups on PC.
The honest tradeoffs:
- Mac is slower than NVIDIA for models that fit in 24GB VRAM
- Mac wins for models larger than 24GB (70B+)
- Mac is simpler โ no driver issues, no CUDA version conflicts
- Mac is quieter and more power-efficient
For most Mac users:
- 8GB: Stick to 3B models
- 16GB: 7B-8B models work great
- 24GB+: 14B models are practical
- 48GB+: 32B models shine
- 64GB+: 70B models become accessible
- 128GB: No practical model limits
Install Ollama, download Llama 3.1 or Qwen 2.5, and start chatting. Your Mac is already an AI workstation.