Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM
Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM
If your documents are sensitive, your compliance team is grumpy, or you just do not want every paragraph of every internal wiki page going through an external API, you have one real option: run the whole retrieval-augmented generation stack yourself. The good news is that the open-source pieces have matured to the point where you can stand a useful private RAG system up in an afternoon.
This post walks through a stack that I have found to be a sane default: Ollama for the local model, Qdrant for the vector store, and AnythingLLM for the front end and ingestion. We will also touch on Onyx as the heavier alternative when AnythingLLM hits its ceiling.
Why these three
Each of these projects covers a clear slice of the stack:
- Ollama is the local LLM runner. It pulls a model, exposes an HTTP API on
localhost:11434, and gets out of your way. The README lists a long list of supported models, including Llama, Mistral, Qwen, DeepSeek, and Gemma, served through the llama.cpp backend. - Qdrant is the vector database. Written in Rust, Apache 2.0 licensed, with on-disk storage, payload filtering, hybrid search through sparse and dense vectors, and quantization to keep memory in check.
- AnythingLLM is the application layer. It wraps document ingestion, chunking, embedding, retrieval, and the chat UI. It supports a long list of vector stores including Qdrant, and a long list of LLM providers including Ollama.
If you want a deeper introduction to self-hosted setups in general, our piece on self-hosted AI coding tools covers the broader picture and many of the same trade-offs apply here.
Docker Compose layout
Here is a minimal compose file that gets all three running on one box. Adjust volumes and ports for your environment.
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant-data:/qdrant/storage
anythingllm:
image: mintplexlabs/anythingllm:latest
ports:
- "3001:3001"
environment:
- STORAGE_DIR=/app/server/storage
volumes:
- anythingllm-data:/app/server/storage
volumes:
ollama-data:
qdrant-data:
anythingllm-data:
Bring it up with docker compose up -d. Hit http://localhost:3001 for AnythingLLM, http://localhost:6333/dashboard for the Qdrant dashboard, and http://localhost:11434 to confirm Ollama is healthy.
Pulling a model into Ollama
You need at least one chat model and one embedding model. From the host:
docker exec -it $(docker ps -qf name=ollama) ollama pull llama3.1:8b
docker exec -it $(docker ps -qf name=ollama) ollama pull nomic-embed-text
The 8B class chat model is a reasonable starting point on a single-GPU machine. The nomic-embed-text embedding model is small, fast, and a sensible default. If your domain has heavy jargon you can switch to a domain-specific embedder later.
Wiring AnythingLLM to Qdrant and Ollama
In AnythingLLM, point the LLM provider at Ollama using the URL http://ollama:11434 since both containers share a Docker network. Pick the model you pulled, for example llama3.1:8b.
For embeddings, use the same Ollama instance and point it at nomic-embed-text.
For the vector database, choose Qdrant and point it at http://qdrant:6333. AnythingLLM will create collections automatically as you create workspaces.
Document ingestion: where most quality lives
You can drop PDFs, DOCX, TXT, and other formats straight into a workspace through the AnythingLLM UI. Each workspace becomes its own collection in Qdrant, which gives you tenant isolation by default.
A few things to think about during ingestion:
- Chunk size. Defaults are usually 500 to 1000 tokens with a small overlap. Smaller chunks improve precision but hurt synthesis. Larger chunks improve synthesis but blunt retrieval.
- Metadata. Even simple metadata such as the source filename, date, and section helps a lot during filtered retrieval, and Qdrant's payload filtering makes good use of it.
- Cleaning. Removing nav menus, footers, and boilerplate before ingestion is the single highest-leverage thing you can do for retrieval quality.
Retrieval tuning
Once you have data flowing, here is where to spend your time:
- Hybrid search. Qdrant supports sparse plus dense vectors, an analogue to BM25 plus semantic search. AnythingLLM exposes the knobs that drive this, and turning it on usually helps on technical content full of acronyms and exact tokens.
- Top-k. Start at five and only raise it when you have a real reason. Larger top-k feeds more context, which is good, but it also feeds more noise.
- Reranking. If you have headroom in your latency budget, route the top-k through a rerank step before passing to the LLM. AnythingLLM supports reranker integrations.
When to graduate to Onyx
AnythingLLM is the friendliest path. It is a single container, has a clean UI, and covers most needs. If you outgrow it, the natural step up is Onyx. Onyx ships with 50-plus connectors for SaaS sources, agentic RAG with hybrid search, deep research multi-step flows, code execution sandboxes, role-based access control, and full audit logging. It is more work to operate and more work to deploy, but it is built for organizations rather than a single team.
Verifying everything end to end
A quick smoke test from the command line:
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"hello","stream":false}'
curl http://localhost:6333/collections
If the first command returns text and the second lists collections, the plumbing is right. From there it is all about your data and your retrieval choices.
Closing thoughts
A private RAG stack is no longer exotic. With three open-source projects, a few container images, and a careful pass over your documents, you can serve internal queries without sending a single token to a third party. Start small, measure retrieval quality on real questions from real users, and only add complexity when you can show it helps.
Source code for the three projects: Ollama on GitHub, Qdrant on GitHub, and AnythingLLM on GitHub.
Tools mentioned in this post
- Ollama: local LLM runtime that exposes a simple HTTP API.
- Qdrant: Rust-based vector database with hybrid search and on-disk storage.
- AnythingLLM: all-in-one document chat application with broad vector store and LLM support.
- Onyx: heavier open-source enterprise search and RAG platform with many connectors.
Related Tools
AnythingLLM
All-in-one desktop and Docker app for private LLM chat with your documents.
Ollama
Run large language models locally with a simple CLI interface
Onyx
Self-hosted application layer for LLMs with chat, RAG, web search, code execution, and agents.
Qdrant
High-performance vector database for similarity search
More Articles
From OpenAI to LiteLLM: Cutting the AI Bill with Smart Routing
A first-person take on putting LiteLLM in front of OpenAI, Anthropic, and a local Ollama instance, with routing rules, fallbacks, and observability. Plus when not to bother.
Self-Hosting an Open WebUI ChatGPT Clone with Model Rotation
A practical walkthrough for standing up Open WebUI on your own box, plugging Ollama in for local models, and rotating to remote backends per chat through a unified proxy.
RAG Is Dead, Long Live RAG: Where Retrieval Is Going
The 'RAG is dead' meme misses what is actually happening. Hybrid retrieval, late-interaction models, agentic retrieval, and contextual chunking are quietly reshaping the field.