Self-Hosting an Open WebUI ChatGPT Clone with Model Rotation
Self-Hosting an Open WebUI ChatGPT Clone with Model Rotation
If you have ever wanted a ChatGPT-style web app that you fully control, with the freedom to point each conversation at a different model on a different backend, the open source stack has caught up to that wish. The combination most teams converge on right now is Open WebUI for the front end, Ollama for local model hosting, and LiteLLM for proxying remote providers behind an OpenAI-compatible API. In this post I will walk through that exact setup, including a docker compose file, document upload, and multi-user access.
This is not a theoretical post. It is the boring middle of self-hosting where you actually wire things together. If you want a wider survey of the tools in this space, my colleague has a piece on self-hosted AI coding tools that pairs well with this one.
Why this stack
Open WebUI started life as the Ollama Web UI but has grown into something much closer to a polished ChatGPT clone. Its README lists a long set of features that go well beyond a chat box: a local RAG integration that supports multiple vector databases including ChromaDB, PGVector, Qdrant, Milvus, and Pinecone, web search through providers like SearXNG, Brave, Kagi, and DuckDuckGo, and a hands-free voice and video call mode that can use Whisper or OpenAI for speech to text. There is also role-based access control, LDAP and SSO, and SCIM 2.0 provisioning for Okta, Azure AD, and Google Workspace.
Ollama is the friendlier face of llama.cpp. You install it with a one-line script, run ollama pull llama3.1, and it exposes a REST API on port 11434 with both its native shape and an OpenAI-compatible endpoint. That OpenAI-compatible piece is what lets it slot in cleanly behind any client that already speaks the OpenAI protocol.
LiteLLM is the glue when you want to mix local and remote models. It sits in front of providers like Anthropic, OpenAI, Bedrock, Vertex AI, and Ollama, and exposes one OpenAI-shaped API. You can also skip it and use Open WebUI's built-in support for OpenAI-compatible APIs to add Anthropic or OpenRouter directly. I will show both approaches.
A docker compose starting point
Open WebUI ships an official image at ghcr.io/open-webui/open-webui:main. Their README also documents a bundled image that includes Ollama for GPU machines, but I find it cleaner to keep services separate so you can restart one without the other.
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
depends_on:
- ollama
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=True
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
restart: unless-stopped
volumes:
ollama:
open-webui:
Bring it up with docker compose up -d, hit http://localhost:3000, and the first account you create becomes the admin. From there, open the admin settings and pull a model: ollama pull llama3.1 or whatever fits your hardware. Open WebUI will pick it up automatically because it points at the OLLAMA_BASE_URL you set.
For GPU support, the Open WebUI README documents adding --gpus all to the run command and using the cuda tagged image. The compose translation is to add a deploy block with nvidia device reservations on the Ollama service, since that is the container actually running the model.
Adding remote models
Open WebUI supports OpenAI-compatible APIs out of the box. In the admin settings under Connections you can add base URLs and API keys for OpenAI itself, Anthropic via a compatible shim, or any provider. Each connection becomes a model source you can pick per chat.
That works fine for two or three providers. Once you start mixing local models, multiple cloud providers, and want unified spend tracking, LiteLLM earns its keep. Run it as a sidecar:
litellm:
image: ghcr.io/berriai/litellm:main-stable
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
volumes:
- ./litellm.config.yaml:/app/config.yaml
command: ["--config", "/app/config.yaml"]
ports:
- "4000:4000"
The LiteLLM README describes virtual key management, per user and per project spend tracking, load balancing across deployments, and request routing with retry and fallback logic. Their config takes a list of model entries, each pointing at an upstream provider, and exposes them under a single OpenAI-shaped endpoint. In Open WebUI, point a single OpenAI-compatible connection at http://litellm:4000 and every model you defined shows up.
Documents and RAG
Click the plus icon in any chat and you can upload a PDF, a Word doc, or a folder of text files. Open WebUI's RAG integration handles the chunking and embedding for you. By default it uses an internal vector store, but the README lists ChromaDB, PGVector, Qdrant, Milvus, Elasticsearch, OpenSearch, Pinecone, S3Vector, and Oracle 23ai as options you can switch to via environment variables. For a small team, the default is fine. For a larger workspace, point it at PGVector and your existing Postgres and call it a day.
There is also a Knowledge feature where admins can build curated collections that all users can query. Think of it as a shared folder of source material that any chat can pull from with a slash command.
Multi-user setup
Set WEBUI_AUTH=True in the environment, as in the compose file above, and Open WebUI will require login. The first registered account is the admin; subsequent registrations are queued for approval by default. From the admin panel you can create user groups, set per group model access, and toggle features like web search or document upload. The README documents granular permissions for who can pull models, who can create custom prompts, and who can manage knowledge bases.
For an actual production rollout, plug in your identity provider. Open WebUI supports OAuth, trusted header SSO, LDAP, and SCIM 2.0. The SCIM piece matters if you have an existing IdP that should be the source of truth for who joins or leaves the team.
Where it bites
Two practical notes. First, the Ollama base URL needs to be reachable from inside the Open WebUI container. If you skip docker compose and run Ollama on the host, use host.docker.internal on Mac and Windows, or --network host on Linux. Second, RAG quality depends on your embedding model and chunk size. The defaults are reasonable, but if your documents are technical or non-English, swap the embedding model in the admin settings.
For the official source, the Open WebUI repo lives at https://github.com/open-webui/open-webui and tracks new features fast.
Tools mentioned in this post
- Open WebUI: self-hosted ChatGPT-style web app with RAG, voice, RBAC, and OpenAI-compatible API support.
- Ollama: local model runtime with REST API and OpenAI-compatible endpoint, built on llama.cpp.
- LiteLLM: proxy gateway that unifies 100+ LLM providers behind one OpenAI-shaped API with routing, fallbacks, and spend tracking.
Related Tools
More Articles
From OpenAI to LiteLLM: Cutting the AI Bill with Smart Routing
A first-person take on putting LiteLLM in front of OpenAI, Anthropic, and a local Ollama instance, with routing rules, fallbacks, and observability. Plus when not to bother.
Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM
An end-to-end blueprint for a fully self-hosted RAG system using Ollama for inference, Qdrant for the vector store, and AnythingLLM for ingestion and chat.
Self-Hosted AI Coding Tools: Run Your Own Copilot
If you want AI code assistance without sending code to the cloud, these self-hosted options work.