vllmqwen3self-hostingllm-inferencelocal-llm

Running Qwen3 Locally with vLLM on a Single 4090, Setup and Notes

Billy C

Running Qwen3 Locally with vLLM on a Single 4090, Setup and Notes

This is a setup walkthrough for serving a Qwen3 model on a single consumer GPU using vLLM. I am writing this from notes I took while wiring this up on a workstation with a 4090 and 64 GB of system RAM. The goal is a local, OpenAI-compatible endpoint I can point editor extensions and tools at.

I will not be quoting tokens-per-second figures. Hardware, drivers, prompt length, and batch size move those numbers around enough that any number I post would be misleading. I will speak in qualitative terms: "interactive on a single user," "fine for batch use," "uncomfortably slow."

If you want a broader survey of self-hosted setups in general, see Self-Hosted AI Coding Tools for a related angle.

Picking a Qwen3 size that fits

The Qwen3 series, per Alibaba's Qwen3 repository, ships in dense sizes of 0.6B, 1.7B, 4B, 8B, 14B, 32B, plus MoE variants at 30B-A3B and 235B-A22B. The 2025-era release line includes additional 4B, 30B-A3B, and 235B-A22B variants.

On a single 24 GB GPU, the realistic options are:

  • 0.6B, 1.7B, 4B, 8B in BF16 or FP16. These fit comfortably and leave room for KV cache.
  • 14B in BF16 is borderline and depends on context length. With short contexts it fits, with long contexts it spills.
  • 14B in 4-bit (AWQ or GPTQ) fits cleanly.
  • 32B in 4-bit fits with a tight context budget. Practical, but you give up some KV cache headroom.
  • 30B-A3B MoE in 4-bit can fit but you should plan capacity carefully and test.
  • 235B-A22B is not realistic on a single 4090.

For a coding assistant where I want fast turnaround on a single user, I have been using Qwen3 8B or 14B. For a general Q-and-A endpoint where I want stronger reasoning, I drop to 32B in 4-bit and accept a smaller context window.

Quantization options

Qwen3 weights are released under Apache-2.0. The Qwen team and the community publish:

  • BF16/FP16 base weights from the official Qwen Hugging Face org.
  • Community AWQ and GPTQ 4-bit quants for most sizes.
  • GGUF quants for use with llama.cpp.

For vLLM specifically, AWQ and GPTQ are the practical 4-bit options at the moment. vLLM also supports FP8 on capable hardware, but the 4090 does not have native FP8 like Hopper-class cards, so 4-bit weight-only is the typical fit.

Installing vLLM

vLLM is normally installed in a fresh virtual environment. I use uv but plain pip works fine.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install vllm

Make sure your CUDA driver is recent enough for the vLLM build you installed. The vLLM project documents supported CUDA versions for each release; check the upstream notes if you hit a CUDA library mismatch at startup.

Starting the OpenAI-compatible server

vLLM ships an OpenAI-compatible HTTP server. The basic command for an 8B model in BF16 looks like this:

vllm serve Qwen/Qwen3-8B \
  --port 8000 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16

A few flags worth knowing:

  • --max-model-len controls the maximum context window. Larger values reserve more KV cache space and reduce headroom for batched users. On a single GPU, set it to the smallest value that covers your real prompts.
  • --gpu-memory-utilization is the fraction of VRAM vLLM is allowed to use. 0.90 is a reasonable default for a dedicated machine. Lower it if you also want to run a desktop session on the same GPU.
  • --dtype bfloat16 is fine for any modern Ada Lovelace card.
  • --tensor-parallel-size 1 is the default and what you want on a single GPU.

For a 4-bit AWQ build:

vllm serve Qwen/Qwen3-14B-AWQ \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --quantization awq

The exact Hugging Face repo names will depend on which community 4-bit build you trust. Always check that the publisher is a name you recognize, and prefer the official Qwen org when an official quant exists.

Talking to the endpoint

Once vLLM is up, it exposes the OpenAI-compatible API at /v1. Any client that already speaks OpenAI works against it. The simplest curl test:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Write a haiku about kv cache."}],
    "temperature": 0.2
  }'

For Python, the official openai SDK works if you set base_url to your local server. For an editor like Aider, you set the base URL and the model name and it just works.

Qualitative experience on a single 4090

Some honest observations from running this for a few weeks:

  • Qwen3 8B in BF16 is interactive at moderate context lengths. Single-turn coding questions feel snappy. Long conversations with full context start to feel slower simply because attention scales with sequence length, not because vLLM is doing anything wrong.
  • Qwen3 14B in AWQ is the model I default to for harder coding questions. Single-stream latency is noticeably higher than 8B, but still fine for an interactive editor session.
  • Qwen3 32B in 4-bit feels like a different tool. It is more capable on reasoning-heavy prompts. For one user it is usable. For a small team hitting it concurrently, you would want better hardware or a smaller model.
  • Thinking mode produces longer outputs. If you enable it, expect longer end-to-end response times on every request that triggers reasoning.

A note on alternative engines

If 4-bit GGUF is your preferred format, llama.cpp is the canonical engine and Qwen3 is supported there too. If you want a vLLM-derived engine with broader sampler and quantization options, Aphrodite Engine is a fork built on vLLM's PagedAttention with a wider list of supported quants. Aphrodite is AGPL-3.0, so check the license against your use case.

Wrapping up

A single 4090 will not host a frontier MoE, but it will host a very competent Qwen3 dense model with vLLM and a 4-bit quant. The path is short: install vLLM, pick a size that fits, choose your quant, point your tools at http://localhost:8000/v1. From there it is the same code you would write against any cloud provider, just running on your desk.

Tools mentioned in this post

  • vLLM: Apache-2.0 inference and serving engine with PagedAttention, continuous batching, and an OpenAI-compatible server.
  • Qwen3: Alibaba's Apache-2.0 model series with dense and MoE variants and a thinking mode for harder reasoning.
  • Aphrodite Engine: AGPL-3.0 vLLM-derived engine with extended quantization and sampler support.

Related Tools

More Articles