self-hostedprivacyai-codingopen-source

Self-Hosted AI Coding Tools: Run Your Own Copilot

Billy C

Not everyone can send their code to OpenAI or Anthropic. Regulated industries, government contractors, and security-conscious teams need AI coding assistance that runs on their own infrastructure. The good news: self-hosted options have gotten dramatically better.

Here is what actually works for running your own AI coding assistant.

Why Self-Host?

Three legitimate reasons to self-host AI coding tools:

  1. Compliance. HIPAA, SOC 2, FedRAMP, and similar frameworks may prohibit sending source code to third-party APIs. Self-hosted tools keep code on your infrastructure.

  2. IP protection. If your codebase is your competitive advantage, you may not want it processed by external AI providers — even with their data retention policies.

  3. Cost at scale. For large teams (50+ developers), self-hosted models can be cheaper than per-seat SaaS pricing. The math depends on your GPU costs versus SaaS costs.

Note what is NOT on this list: "because cloud AI is bad." Cloud AI tools like Cursor and Copilot are genuinely better than most self-hosted alternatives. You are trading quality for control.

Option 1: Tabby

Tabby is the most production-ready self-hosted coding assistant. It provides:

  • VS Code and JetBrains extensions
  • Code completion (tab autocomplete)
  • Chat interface
  • Fine-tuning on your codebase

Setup

# Docker (requires NVIDIA GPU)
docker run -it --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve --model StarCoder-3B --device cuda

For a team, deploy on a GPU instance (AWS g5.xlarge or similar) and point everyone's extensions at the server URL.

Hardware Requirements

ModelGPU VRAMQuality
StarCoder-1B4GBBasic completions
StarCoder-3B8GBGood completions
StarCoder-7B16GBNear-Copilot quality
CodeLlama-34B48GB (2x A6000)Excellent quality

For most teams, StarCoder-7B on a single A10G GPU ($1.50/hour on AWS) provides good-enough completions at reasonable cost.

Fine-Tuning

Tabby supports fine-tuning on your codebase. This dramatically improves suggestion quality — the model learns your naming conventions, patterns, and internal APIs:

tabby fine-tune \
  --model StarCoder-3B \
  --data-dir /path/to/your/repos \
  --output /data/models/custom

Fine-tuning takes 2-4 hours on a single GPU and the quality improvement is noticeable, especially for internal framework usage.

Option 2: Ollama + Continue

This combo gives you the most flexibility:

Ollama runs LLMs locally with a simple CLI. Continue is an open-source VS Code extension that connects to any LLM endpoint.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a coding model
ollama pull codellama:13b
# Or for better quality:
ollama pull deepseek-coder-v2:16b

Then configure Continue in VS Code:

{
  "models": [{
    "title": "Local CodeLlama",
    "provider": "ollama",
    "model": "codellama:13b"
  }],
  "tabAutocompleteModel": {
    "title": "Local Autocomplete",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b"
  }
}

This runs entirely on your machine. No server, no network, no data leaves your laptop.

Hardware for Local Development

For running models on a development laptop:

  • Apple Silicon Mac (M2 Pro+, 32GB): Runs 13B models at usable speed
  • NVIDIA RTX 4090 (24GB VRAM): Runs 13B-33B models well
  • NVIDIA RTX 3090 (24GB VRAM): Budget option, runs 13B models

Anything below 16GB unified/VRAM memory will struggle with useful coding models.

Option 3: vLLM + Custom Setup

For teams that want maximum performance and control, vLLM provides a high-performance inference server:

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-coder-v2-lite-instruct \
  --port 8000 \
  --tensor-parallel-size 2  # For multi-GPU

vLLM exposes an OpenAI-compatible API, meaning any tool that works with OpenAI (Continue, Aider, most AI coding tools) works with your self-hosted model.

Quality Comparison

Being honest about quality:

SetupCompletion QualityChat QualityCost
GitHub Copilot (cloud)9/107/10$10/mo/user
Cursor (cloud)9/109/10$20/mo/user
Tabby + StarCoder-7B6/105/10~$100/mo (GPU)
Ollama + DeepSeek-Coder7/106/10Hardware cost
Tabby + CodeLlama-34B7/107/10~$250/mo (GPU)

Self-hosted models are 2-3 quality points behind cloud tools. The gap is closing but it is real. You are paying a quality tax for data control.

When Self-Hosting Makes Sense

Yes, self-host if:

  • Compliance requires it (no choice)
  • Your team is 50+ developers (cost savings)
  • You have existing GPU infrastructure
  • Your codebase benefits heavily from fine-tuning

No, use cloud if:

  • You are a small team with no compliance constraints
  • Maximum code quality matters more than data control
  • You do not want to maintain ML infrastructure

The pragmatic approach for most teams: use cloud AI tools with a clear data retention policy from the provider. Only self-host when there is a genuine requirement.


Explore self-hosted AI tools on BuilderAI

More Articles