ragself-hostedollamaqdrantanythingllm

Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM

Billy C

Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM

If your documents are sensitive, your compliance team is grumpy, or you just do not want every paragraph of every internal wiki page going through an external API, you have one real option: run the whole retrieval-augmented generation stack yourself. The good news is that the open-source pieces have matured to the point where you can stand a useful private RAG system up in an afternoon.

This post walks through a stack that I have found to be a sane default: Ollama for the local model, Qdrant for the vector store, and AnythingLLM for the front end and ingestion. We will also touch on Onyx as the heavier alternative when AnythingLLM hits its ceiling.

Why these three

Each of these projects covers a clear slice of the stack:

  • Ollama is the local LLM runner. It pulls a model, exposes an HTTP API on localhost:11434, and gets out of your way. The README lists a long list of supported models, including Llama, Mistral, Qwen, DeepSeek, and Gemma, served through the llama.cpp backend.
  • Qdrant is the vector database. Written in Rust, Apache 2.0 licensed, with on-disk storage, payload filtering, hybrid search through sparse and dense vectors, and quantization to keep memory in check.
  • AnythingLLM is the application layer. It wraps document ingestion, chunking, embedding, retrieval, and the chat UI. It supports a long list of vector stores including Qdrant, and a long list of LLM providers including Ollama.

If you want a deeper introduction to self-hosted setups in general, our piece on self-hosted AI coding tools covers the broader picture and many of the same trade-offs apply here.

Docker Compose layout

Here is a minimal compose file that gets all three running on one box. Adjust volumes and ports for your environment.

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant-data:/qdrant/storage

  anythingllm:
    image: mintplexlabs/anythingllm:latest
    ports:
      - "3001:3001"
    environment:
      - STORAGE_DIR=/app/server/storage
    volumes:
      - anythingllm-data:/app/server/storage

volumes:
  ollama-data:
  qdrant-data:
  anythingllm-data:

Bring it up with docker compose up -d. Hit http://localhost:3001 for AnythingLLM, http://localhost:6333/dashboard for the Qdrant dashboard, and http://localhost:11434 to confirm Ollama is healthy.

Pulling a model into Ollama

You need at least one chat model and one embedding model. From the host:

docker exec -it $(docker ps -qf name=ollama) ollama pull llama3.1:8b
docker exec -it $(docker ps -qf name=ollama) ollama pull nomic-embed-text

The 8B class chat model is a reasonable starting point on a single-GPU machine. The nomic-embed-text embedding model is small, fast, and a sensible default. If your domain has heavy jargon you can switch to a domain-specific embedder later.

Wiring AnythingLLM to Qdrant and Ollama

In AnythingLLM, point the LLM provider at Ollama using the URL http://ollama:11434 since both containers share a Docker network. Pick the model you pulled, for example llama3.1:8b.

For embeddings, use the same Ollama instance and point it at nomic-embed-text.

For the vector database, choose Qdrant and point it at http://qdrant:6333. AnythingLLM will create collections automatically as you create workspaces.

Document ingestion: where most quality lives

You can drop PDFs, DOCX, TXT, and other formats straight into a workspace through the AnythingLLM UI. Each workspace becomes its own collection in Qdrant, which gives you tenant isolation by default.

A few things to think about during ingestion:

  • Chunk size. Defaults are usually 500 to 1000 tokens with a small overlap. Smaller chunks improve precision but hurt synthesis. Larger chunks improve synthesis but blunt retrieval.
  • Metadata. Even simple metadata such as the source filename, date, and section helps a lot during filtered retrieval, and Qdrant's payload filtering makes good use of it.
  • Cleaning. Removing nav menus, footers, and boilerplate before ingestion is the single highest-leverage thing you can do for retrieval quality.

Retrieval tuning

Once you have data flowing, here is where to spend your time:

  • Hybrid search. Qdrant supports sparse plus dense vectors, an analogue to BM25 plus semantic search. AnythingLLM exposes the knobs that drive this, and turning it on usually helps on technical content full of acronyms and exact tokens.
  • Top-k. Start at five and only raise it when you have a real reason. Larger top-k feeds more context, which is good, but it also feeds more noise.
  • Reranking. If you have headroom in your latency budget, route the top-k through a rerank step before passing to the LLM. AnythingLLM supports reranker integrations.

When to graduate to Onyx

AnythingLLM is the friendliest path. It is a single container, has a clean UI, and covers most needs. If you outgrow it, the natural step up is Onyx. Onyx ships with 50-plus connectors for SaaS sources, agentic RAG with hybrid search, deep research multi-step flows, code execution sandboxes, role-based access control, and full audit logging. It is more work to operate and more work to deploy, but it is built for organizations rather than a single team.

Verifying everything end to end

A quick smoke test from the command line:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"hello","stream":false}'

curl http://localhost:6333/collections

If the first command returns text and the second lists collections, the plumbing is right. From there it is all about your data and your retrieval choices.

Closing thoughts

A private RAG stack is no longer exotic. With three open-source projects, a few container images, and a careful pass over your documents, you can serve internal queries without sending a single token to a third party. Start small, measure retrieval quality on real questions from real users, and only add complexity when you can show it helps.

Source code for the three projects: Ollama on GitHub, Qdrant on GitHub, and AnythingLLM on GitHub.

Tools mentioned in this post

  • Ollama: local LLM runtime that exposes a simple HTTP API.
  • Qdrant: Rust-based vector database with hybrid search and on-disk storage.
  • AnythingLLM: all-in-one document chat application with broad vector store and LLM support.
  • Onyx: heavier open-source enterprise search and RAG platform with many connectors.

Related Tools

More Articles