Local RAG: Search Your Documents with a Private AI
📚 More on this topic: Open WebUI Setup Guide · Text Generation WebUI Guide · Best Models for Chat · VRAM Requirements
You’ve got a local LLM running. It answers general questions fine. But ask it about your company docs, your research notes, or a PDF you downloaded — and it hallucinates confidently. The model doesn’t know your data because it was never trained on it.
RAG fixes this. Instead of retraining the model (expensive, slow, overkill), RAG searches your documents for relevant chunks and feeds them to the LLM as context. The model reads your actual text and answers based on it, not from memory.
The best part: everything stays on your machine. No API calls, no cloud storage, no wondering who’s reading your files. This guide walks through three ways to set it up, from zero-config to fully custom.
What RAG Actually Is
RAG stands for Retrieval-Augmented Generation. Here’s what happens when you ask a question:
- Your question gets converted to an embedding — a numerical representation of its meaning
- The system searches your documents for chunks that are semantically similar to your question
- The most relevant chunks get injected into the prompt as context
- The LLM reads those chunks and generates an answer based on your actual documents
It’s not magic. It’s a search engine feeding results to a language model. The LLM isn’t “learning” your documents — it’s reading them on the fly, every time you ask a question.
What RAG Is Not
- Not fine-tuning. Your model weights don’t change. You can swap models anytime.
- Not a database replacement. RAG is for natural language questions, not structured queries. Don’t use it to ask “how many invoices were over $10K” — use SQL for that.
- Not perfect. If your document chunking is bad, the retrieval is bad, and the answers are bad. Garbage in, garbage out.
RAG vs Fine-Tuning
| RAG | Fine-Tuning | |
|---|---|---|
| Setup time | Minutes | Hours to days |
| Hardware needed | Same as inference | Much more (training) |
| Update documents | Just re-embed | Retrain the model |
| Best for | Answering questions about specific docs | Changing model behavior/style |
| Accuracy on your data | High if chunks are good | Variable, can overfit |
| Cost | Low | High |
For most people reading this: you want RAG. Fine-tuning is for changing how a model behaves, not for teaching it facts.
Hardware Requirements
RAG adds two things on top of normal LLM inference: an embedding model and a vector database. The embedding model is the only one that matters for hardware.
| Setup | VRAM/RAM Needed | What You Can Run |
|---|---|---|
| 8GB VRAM | 7-8B LLM (Q4) + embeddings | Good for personal use |
| 12GB VRAM | 13-14B LLM (Q4) + embeddings | Better quality answers |
| 16GB VRAM | 14B LLM (Q6) + embeddings | Sweet spot |
| 24GB VRAM | 32B LLM (Q4) + embeddings | Excellent quality |
| CPU only (16GB RAM) | 7B LLM (Q4) + embeddings | Works, slower |
| CPU only (32GB RAM) | 13B LLM (Q4) + embeddings | Usable for smaller doc sets |
The embedding model typically uses 200MB-1.2GB of VRAM depending on which one you pick. It runs alongside your LLM, so account for both. The vector database (ChromaDB, FAISS) uses minimal RAM — a few hundred MB even for thousands of documents.
Bottom line: If you can run a local LLM today, you can run RAG. The embedding model is tiny by comparison.
→ Use our Planning Tool to check exact VRAM for your setup.
The Easy Way: Open WebUI + Ollama
If you already have Ollama installed, this is the fastest path. Open WebUI has RAG built in — you upload documents and ask questions. No code, no config files.
Setup
1. Pull an embedding model:
ollama pull nomic-embed-text
2. Install Open WebUI:
# With pip (simplest)
pip install open-webui
open-webui serve
# Or with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
3. Pull a model good at RAG:
ollama pull llama3.1:8b
4. Upload documents: In the Open WebUI chat interface, click the + button and upload PDFs, text files, or markdown. Ask questions about them.
That’s it. For a lot of people, this is all you need.
Critical Setting: Fix the Context Window
By default, Ollama sets num_ctx to 2048 tokens. That’s way too small for RAG — your retrieved chunks plus the question will easily exceed that, causing the model to silently drop context.
Fix it in Open WebUI: go to Settings → Models → your model → Parameters and set num_ctx to at least 8192. For longer documents, use 16384 or 32768 if your VRAM allows.
Or create a Modelfile:
FROM llama3.1:8b
PARAMETER num_ctx 16384
ollama create llama3.1-rag -f Modelfile
Open WebUI RAG Settings
Under Settings → Documents, you can tune:
- Chunk size: Default is 1500 characters. Good starting point. Decrease to 500-800 for precise Q&A, increase to 2000-3000 for longer-form answers.
- Chunk overlap: Default is 100 characters. Set to 15-20% of chunk size (so ~225 for 1500-char chunks). This prevents answers from being cut off mid-sentence.
- Embedding model: Select the model you pulled (
nomic-embed-text). - Top K: How many chunks to retrieve per question. Default 4 is fine. Increase to 6-8 for complex questions spanning multiple sections.
When Open WebUI Isn’t Enough
Open WebUI’s RAG works but has limitations:
- No workspace separation — all documents are in one pool
- Limited chunking strategies (character-based only)
- Can’t mix different embedding models per collection
- No hybrid search (keyword + semantic)
If you hit these limits, move to AnythingLLM or the Python approach.
The Power User Way: AnythingLLM
AnythingLLM is a desktop app that gives you more control over RAG without writing code. The key feature: workspaces. You can have separate document collections for different projects, each with their own settings.
Setup
1. Download AnythingLLM from anythingllm.com. Available for Windows, Mac, and Linux.
2. Connect to Ollama: In AnythingLLM settings, set your LLM provider to Ollama and point it at http://localhost:11434.
3. Set your embedding model: Under Settings → Embedding, choose Ollama and select nomic-embed-text.
4. Create a workspace: Give it a name (e.g., “Research Papers” or “Work Docs”).
5. Upload documents: Drag files into the workspace. AnythingLLM chunks and embeds them automatically.
6. Chat: Ask questions and the system retrieves relevant chunks from that workspace only.
Why Choose AnythingLLM Over Open WebUI
| Feature | Open WebUI | AnythingLLM |
|---|---|---|
| Workspaces | No (all docs in one pool) | Yes (separate collections) |
| Supported file types | PDF, TXT, MD | PDF, TXT, MD, DOCX, CSV, and more |
| Chunking control | Basic | More options |
| Vector DB options | Built-in | LanceDB, ChromaDB, Pinecone, others |
| Citation/sources | Basic | Shows which chunks were used |
| Setup difficulty | Easy | Easy |
| Also does chat | Yes (primary purpose) | Yes |
AnythingLLM is the right choice if you want separate document collections for different projects, need to process many file types, or want to see exactly which chunks the model used to answer.
The Developer Way: Python + LangChain
If you want full control — custom chunking, preprocessing, filtering, or integration into your own tools — use Python. This is about 30 lines of working code.
Prerequisites
pip install langchain langchain-community langchain-ollama chromadb
Make sure Ollama is running with a model and embedding model pulled:
ollama pull llama3.1:8b
ollama pull nomic-embed-text
Minimal Working Example
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain.chains import RetrievalQA
# 1. Load documents from a folder
loader = DirectoryLoader("./my_docs", glob="**/*.txt", loader_cls=TextLoader)
docs = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_documents(docs)
# 3. Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
# 4. Create a retrieval chain
llm = ChatOllama(model="llama3.1:8b", num_ctx=8192)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True,
)
# 5. Ask questions
result = qa.invoke({"query": "What are the main findings?"})
print(result["result"])
for doc in result["source_documents"]:
print(f" Source: {doc.metadata['source']}")
Loading Different File Types
from langchain_community.document_loaders import PyPDFLoader, CSVLoader
# PDFs
loader = PyPDFLoader("report.pdf")
# CSVs
loader = CSVLoader("data.csv")
# Multiple file types from a directory
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
Using FAISS Instead of ChromaDB
FAISS is faster for large collections (100K+ chunks):
pip install faiss-cpu # or faiss-gpu if you have CUDA
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("./faiss_index")
# Load later
vectorstore = FAISS.load_local("./faiss_index", embeddings,
allow_dangerous_deserialization=True)
Why Go Custom
- Preprocessing: Strip headers, clean OCR output, normalize formatting before chunking
- Hybrid search: Combine semantic search with keyword (BM25) for better retrieval
- Metadata filtering: Only search documents from a specific date, author, or category
- Pipeline integration: Feed RAG into a larger application, API, or automation
- Evaluation: Measure retrieval quality and answer accuracy programmatically
Best LLMs for RAG
Not all models are equally good at RAG. The model needs to actually read and reason about the provided context, not just generate plausible-sounding text. Models that are good at instruction following and have decent context handling work best.
| VRAM Budget | Model | Why It Works |
|---|---|---|
| 8GB | Llama 3.1 8B (Q4) | Strong instruction following, good at citing context |
| 8GB | Qwen 2.5 7B (Q4) | Excellent at structured answers from context |
| 12GB | Qwen 2.5 14B (Q4) | Noticeable jump in answer quality |
| 16GB | Qwen 2.5 14B (Q6) | Higher quant = better comprehension |
| 24GB | Qwen 2.5 32B (Q4) | Best local RAG quality for most people |
| 24GB | Mistral Small 24B (Q4) | Strong alternative, good at synthesis |
| 48GB+ | Llama 3.1 70B (Q4) | Cloud-quality answers, needs dual GPUs or Mac |
Avoid for RAG: Models that tend to ignore context and rely on training data. Older models (Llama 2, Mistral 7B v0.1) are noticeably worse at staying grounded in retrieved chunks.
Context Length Matters — But Not How You Think
Models advertise 128K context windows, but research consistently shows RAG performance peaks around 16K-32K tokens of actual context. Stuffing more chunks in doesn’t help — it hurts. The model loses focus and starts hallucinating.
Practical rule: Retrieve 4-6 chunks of 1000-1500 characters each. That’s roughly 2000-4000 tokens of context, well within any model’s strong range. If you need more coverage, use better chunking and retrieval rather than cramming more text.
Best Embedding Models
The embedding model determines how well your system finds relevant chunks. A bad embedding model means the right information never reaches the LLM. This matters more than most people realize.
| Model | Dimensions | Size | MTEB Score | Best For |
|---|---|---|---|---|
nomic-embed-text | 768 | ~270MB | 62.4 | General purpose, good default |
mxbai-embed-large | 1024 | ~670MB | 64.7 | Better quality, still reasonable size |
bge-m3 | 1024 | ~1.2GB | 68.2 | Multilingual, highest quality |
all-minilm (sentence-transformers) | 384 | ~80MB | 56.3 | Minimal resources, CPU-friendly |
Start with nomic-embed-text. It’s small, fast, and good enough for most use cases. Move to mxbai-embed-large if you notice retrieval quality issues. Use bge-m3 if you work with multiple languages.
Pull them with Ollama:
ollama pull nomic-embed-text
# or
ollama pull mxbai-embed-large
Important: Once you embed your documents with a specific model, you can’t switch without re-embedding everything. The dimensions and vector space are model-specific. Pick one and stick with it, or plan to re-embed if you upgrade.
Tips for Better RAG Results
These are the most common mistakes and how to fix them.
1. Your Chunks Are Too Big or Too Small
Too big (3000+ chars): Multiple topics per chunk. The LLM gets irrelevant information mixed with relevant information and produces muddled answers.
Too small (200-300 chars): Individual sentences with no context. The LLM gets fragments that don’t make sense on their own.
Sweet spot: 800-1500 characters with 15% overlap. Start at 1000 and adjust based on your documents. Technical docs with clear sections can go larger. Conversational text should go smaller.
2. Your Context Window Is Too Small
The single most common RAG failure. If num_ctx is 2048 (Ollama’s default) and you’re injecting 4 chunks of 1000 characters, you’ve already used most of your context before the model even starts generating. It silently drops the oldest context.
Set num_ctx to at least 8192. Check this first if your answers seem to ignore the documents.
3. The Model Is Ignoring Context
Some models are better at grounding answers in provided context than others. If the model keeps giving generic answers instead of citing your documents:
- Try a different model (Qwen 2.5 and Llama 3.1 are both strong here)
- Add explicit instructions in the system prompt: “Answer ONLY based on the provided context. If the context doesn’t contain the answer, say so.”
- Reduce the number of retrieved chunks — sometimes less is more
4. PDFs Are Messy
PDF text extraction is unreliable. Tables come out garbled. Headers and footers repeat on every page. Columns get merged.
Fixes:
- Use
PyMuPDF(fitz) instead of basic PDF extractors — it handles layouts better - Preprocess extracted text to remove headers/footers and fix formatting
- For scanned PDFs, run OCR first (Tesseract) and clean the output
- Consider converting PDFs to markdown before indexing
5. You’re Not Seeing Sources
Always configure your RAG pipeline to return source documents alongside the answer. This lets you verify the answer and catch hallucinations. Open WebUI shows this by default. In LangChain, use return_source_documents=True.
6. Re-Embedding Is Slow
Embedding thousands of documents takes time on first run. After that, only embed new or changed documents. Both ChromaDB and FAISS support incremental updates. AnythingLLM handles this automatically.
What Hardware to Buy for RAG
If you’re building or upgrading specifically for local RAG, here’s what to prioritize:
| Budget | GPU | What You Get |
|---|---|---|
| ~$750 | RTX 3090 (used) | 24GB VRAM, runs 32B models + embeddings. Best value for RAG. |
| ~$400 | RTX 4060 Ti 16GB | 16GB VRAM, runs 14B + embeddings. Adequate. |
| ~$550 | RTX 5060 Ti 16GB | 16GB VRAM, newer architecture. See our 16GB guide. |
| ~$2000 | RTX 4090 / 5090 | 24-32GB VRAM, fastest inference. |
| ~$2500+ | Mac Studio M4 Max (128GB) | 128GB unified memory, runs 70B+ models. Mac vs PC comparison. |
For most RAG use cases, an RTX 3090 for ~$750 is hard to beat. 24GB VRAM comfortably runs a 32B model plus embeddings — that’s genuinely excellent RAG quality, and it’s the cheapest way to get 24GB.
If you’re CPU-only, RAG still works — it’s just slower. A 7B model on CPU with 16-32GB RAM handles personal document collections fine. Speed is the tradeoff, not quality.
Getting Started Today
Here’s the fastest path from zero to working RAG:
- Install Ollama if you haven’t already
- Pull a model:
ollama pull llama3.1:8b - Pull an embedding model:
ollama pull nomic-embed-text - Install Open WebUI:
pip install open-webui && open-webui serve - Fix context window: set
num_ctxto 8192+ in model settings - Upload a document and ask it a question
You’ll have private, local document search running within minutes. No API keys, no subscriptions, no data leaving your machine.
If you outgrow Open WebUI, AnythingLLM gives you workspaces and better organization. If you outgrow that, Python and LangChain give you unlimited control. The beauty of local RAG: you own the whole stack, so you can swap any piece at any time.