From OpenAI to LiteLLM: Cutting the AI Bill with Smart Routing
From OpenAI to LiteLLM: Cutting the AI Bill with Smart Routing
For about two years our backend talked to one provider. The OpenAI Python SDK was imported in roughly forty places, the API key was in one secret, and life was simple. Then a few things happened. Anthropic released a model that was clearly better for our document summarization endpoint. We started running Ollama locally for an internal tool that did not need cloud quality. And our finance team asked, politely, if we could stop spending the GDP of a small island on chat tokens.
The pragmatic answer was LiteLLM. This post is a write-up of how we adopted it, what worked, and where it did not pull its weight. If you are also navigating the multi-provider question, see also how to evaluate AI developer tools for some general principles that helped us avoid analysis paralysis.
What LiteLLM actually is
The LiteLLM README describes it as an open source AI gateway with drop-in OpenAI compatibility. It supports the standard OpenAI endpoints including chat completions, embeddings, images, and audio, plus Anthropic's messages endpoint, in front of 100+ providers. You can use it two ways. The first is as a Python library you import: from litellm import completion and you pass a model string like anthropic/claude-3.5-sonnet or openai/gpt-4o-mini. The second is the proxy server, which is what we run.
The proxy is a small service that exposes an OpenAI-shaped API. Your apps keep talking OpenAI. The proxy translates to whatever upstream you configured. You manage virtual keys, routing rules, and budgets through a config file or admin UI.
The migration
Step one was the boring part: pointing every existing call site at the proxy URL instead of https://api.openai.com/v1. Because LiteLLM is OpenAI-compatible, this was a base URL change and nothing else. The OpenAI SDK accepts a base_url parameter, and once we set that, behavior was identical.
Step two was actually defining models. LiteLLM's config takes a list of model entries. Each entry has a name your apps will use, an upstream model and provider, and any provider-specific overrides.
model_list:
- model_name: chat-default
litellm_params:
model: openai/gpt-4o-mini
- model_name: chat-quality
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
- model_name: chat-local
litellm_params:
model: ollama/llama3.1
api_base: http://ollama:11434
Now chat-default, chat-quality, and chat-local are first class names from the app's perspective. We do not have to know that quality means Anthropic and local means Ollama. That decoupling is the whole point.
Routing and fallbacks
The next layer is routing. LiteLLM lets you define a list of upstreams under a single model name and balance load across them. The README describes load balancing across multiple deployments, request routing, and retry and fallback logic.
Two routing patterns earned their keep for us. First, weighted load balancing across two regional Azure OpenAI deployments to absorb rate limit spikes. Second, fallback chains, so when our primary provider returned a 429 or a 5xx, the request transparently retried against the secondary. We learned the hard way to make the fallback explicit per model rather than global, since some workloads care about determinism in ways others do not.
What we did not do, even though LiteLLM supports it, is automatic cost-based routing where every request shops for the cheapest provider. The latency variance was not worth the savings, and our usage profile already had a clear "small fast vs big quality" split that we wanted to make explicit per call site.
Observability and budgets
The bill anxiety was the original motivation, and this is where the LiteLLM proxy paid off in a way the bare library cannot. The proxy includes per user and per project spend tracking and an admin dashboard for operational monitoring. We assigned a virtual key per service, attached a monthly budget, and got real numbers without instrumenting forty call sites.
When a key hits its soft budget, we get a Slack ping. When it hits the hard budget, the proxy returns 429 and the calling service handles it like any other rate limit. That bit is worth thinking through up front. You do not want a forgotten background job to lock you out of production.
LiteLLM also forwards request metadata to popular logging stacks. The README mentions comprehensive logging and observability integration, and in practice we ship traces to our existing OTLP collector. The cardinality is fine because the proxy normalizes provider names.
When to keep using one provider
LiteLLM is an obvious win when you have multiple real providers in play, when finance wants a unified bill, or when you are mixing local and cloud models for a self-hosted product. It is overkill in three cases.
The first is a small project with one provider and no immediate plans to add another. The proxy is one more thing to deploy. Unless you have a concrete second provider on the roadmap, the import is fine.
The second is a workload where every millisecond matters and you cannot tolerate the proxy hop. LiteLLM is fast, but it is still a network hop in front of a network hop. For interactive code completion at the cursor level, we kept a direct provider connection.
The third is a workload that needs provider-specific features the OpenAI shape does not express. Anthropic's prompt caching, OpenAI's structured output JSON schema, and provider-specific thinking modes all work through LiteLLM, but feature lag is real. New flagship features sometimes take a week or two to land in the proxy. If you are an early adopter team, account for that.
Practical notes
A few things we wish we had known. First, model name discipline matters. Once an app uses chat-default, that name is a contract. Renaming it across services is just like renaming a database column: doable but annoying. Pick names you will be happy with in eighteen months.
Second, set sensible default timeouts. LiteLLM's defaults are reasonable, but provider hiccups manifest as long-tail latencies, and you will want a circuit breaker pattern in your client.
Third, when you add a new provider, do it on a single low-stakes endpoint first. We made the mistake of swapping a high-traffic summarization endpoint to a new provider in one go and then debugging quality regressions in production. Canary it.
The official LiteLLM repo and docs live at https://github.com/BerriAI/litellm and the proxy server documentation is the best entry point if you are evaluating the gateway pattern.
Tools mentioned in this post
Related Tools
More Articles
Self-Hosting an Open WebUI ChatGPT Clone with Model Rotation
A practical walkthrough for standing up Open WebUI on your own box, plugging Ollama in for local models, and rotating to remote backends per chat through a unified proxy.
Building a Private RAG Stack with Ollama, Qdrant, and AnythingLLM
An end-to-end blueprint for a fully self-hosted RAG system using Ollama for inference, Qdrant for the vector store, and AnythingLLM for ingestion and chat.