Self-Hosted AI Coding Tools: Run Your Own Copilot
Not everyone can send their code to OpenAI or Anthropic. Regulated industries, government contractors, and security-conscious teams need AI coding assistance that runs on their own infrastructure. The good news: self-hosted options have gotten dramatically better.
Here is what actually works for running your own AI coding assistant.
Why Self-Host?
Three legitimate reasons to self-host AI coding tools:
-
Compliance. HIPAA, SOC 2, FedRAMP, and similar frameworks may prohibit sending source code to third-party APIs. Self-hosted tools keep code on your infrastructure.
-
IP protection. If your codebase is your competitive advantage, you may not want it processed by external AI providers — even with their data retention policies.
-
Cost at scale. For large teams (50+ developers), self-hosted models can be cheaper than per-seat SaaS pricing. The math depends on your GPU costs versus SaaS costs.
Note what is NOT on this list: "because cloud AI is bad." Cloud AI tools like Cursor and Copilot are genuinely better than most self-hosted alternatives. You are trading quality for control.
Option 1: Tabby
Tabby is the most production-ready self-hosted coding assistant. It provides:
- VS Code and JetBrains extensions
- Code completion (tab autocomplete)
- Chat interface
- Fine-tuning on your codebase
Setup
# Docker (requires NVIDIA GPU)
docker run -it --gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder-3B --device cuda
For a team, deploy on a GPU instance (AWS g5.xlarge or similar) and point everyone's extensions at the server URL.
Hardware Requirements
| Model | GPU VRAM | Quality |
|---|---|---|
| StarCoder-1B | 4GB | Basic completions |
| StarCoder-3B | 8GB | Good completions |
| StarCoder-7B | 16GB | Near-Copilot quality |
| CodeLlama-34B | 48GB (2x A6000) | Excellent quality |
For most teams, StarCoder-7B on a single A10G GPU ($1.50/hour on AWS) provides good-enough completions at reasonable cost.
Fine-Tuning
Tabby supports fine-tuning on your codebase. This dramatically improves suggestion quality — the model learns your naming conventions, patterns, and internal APIs:
tabby fine-tune \
--model StarCoder-3B \
--data-dir /path/to/your/repos \
--output /data/models/custom
Fine-tuning takes 2-4 hours on a single GPU and the quality improvement is noticeable, especially for internal framework usage.
Option 2: Ollama + Continue
This combo gives you the most flexibility:
Ollama runs LLMs locally with a simple CLI. Continue is an open-source VS Code extension that connects to any LLM endpoint.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a coding model
ollama pull codellama:13b
# Or for better quality:
ollama pull deepseek-coder-v2:16b
Then configure Continue in VS Code:
{
"models": [{
"title": "Local CodeLlama",
"provider": "ollama",
"model": "codellama:13b"
}],
"tabAutocompleteModel": {
"title": "Local Autocomplete",
"provider": "ollama",
"model": "deepseek-coder-v2:16b"
}
}
This runs entirely on your machine. No server, no network, no data leaves your laptop.
Hardware for Local Development
For running models on a development laptop:
- Apple Silicon Mac (M2 Pro+, 32GB): Runs 13B models at usable speed
- NVIDIA RTX 4090 (24GB VRAM): Runs 13B-33B models well
- NVIDIA RTX 3090 (24GB VRAM): Budget option, runs 13B models
Anything below 16GB unified/VRAM memory will struggle with useful coding models.
Option 3: vLLM + Custom Setup
For teams that want maximum performance and control, vLLM provides a high-performance inference server:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-coder-v2-lite-instruct \
--port 8000 \
--tensor-parallel-size 2 # For multi-GPU
vLLM exposes an OpenAI-compatible API, meaning any tool that works with OpenAI (Continue, Aider, most AI coding tools) works with your self-hosted model.
Quality Comparison
Being honest about quality:
| Setup | Completion Quality | Chat Quality | Cost |
|---|---|---|---|
| GitHub Copilot (cloud) | 9/10 | 7/10 | $10/mo/user |
| Cursor (cloud) | 9/10 | 9/10 | $20/mo/user |
| Tabby + StarCoder-7B | 6/10 | 5/10 | ~$100/mo (GPU) |
| Ollama + DeepSeek-Coder | 7/10 | 6/10 | Hardware cost |
| Tabby + CodeLlama-34B | 7/10 | 7/10 | ~$250/mo (GPU) |
Self-hosted models are 2-3 quality points behind cloud tools. The gap is closing but it is real. You are paying a quality tax for data control.
When Self-Hosting Makes Sense
Yes, self-host if:
- Compliance requires it (no choice)
- Your team is 50+ developers (cost savings)
- You have existing GPU infrastructure
- Your codebase benefits heavily from fine-tuning
No, use cloud if:
- You are a small team with no compliance constraints
- Maximum code quality matters more than data control
- You do not want to maintain ML infrastructure
The pragmatic approach for most teams: use cloud AI tools with a clear data retention policy from the provider. Only self-host when there is a genuine requirement.
Explore self-hosted AI tools on BuilderAI →
More Articles
AI Pair Programming: 10 Tips to Get Better Results
Using AI as your pair programmer works — if you know how to work with it. Here are 10 tips.
How to Build a Developer Tool with AI in a Weekend
A step-by-step walkthrough of building and shipping a dev tool using AI coding assistants.
Why Developers Are Switching to AI-First IDEs
VS Code plugins are not enough anymore. AI-native editors are taking over for a reason.