Running Qwen3 Locally with vLLM on a Single 4090, Setup and Notes
Running Qwen3 Locally with vLLM on a Single 4090, Setup and Notes
This is a setup walkthrough for serving a Qwen3 model on a single consumer GPU using vLLM. I am writing this from notes I took while wiring this up on a workstation with a 4090 and 64 GB of system RAM. The goal is a local, OpenAI-compatible endpoint I can point editor extensions and tools at.
I will not be quoting tokens-per-second figures. Hardware, drivers, prompt length, and batch size move those numbers around enough that any number I post would be misleading. I will speak in qualitative terms: "interactive on a single user," "fine for batch use," "uncomfortably slow."
If you want a broader survey of self-hosted setups in general, see Self-Hosted AI Coding Tools for a related angle.
Picking a Qwen3 size that fits
The Qwen3 series, per Alibaba's Qwen3 repository, ships in dense sizes of 0.6B, 1.7B, 4B, 8B, 14B, 32B, plus MoE variants at 30B-A3B and 235B-A22B. The 2025-era release line includes additional 4B, 30B-A3B, and 235B-A22B variants.
On a single 24 GB GPU, the realistic options are:
- 0.6B, 1.7B, 4B, 8B in BF16 or FP16. These fit comfortably and leave room for KV cache.
- 14B in BF16 is borderline and depends on context length. With short contexts it fits, with long contexts it spills.
- 14B in 4-bit (AWQ or GPTQ) fits cleanly.
- 32B in 4-bit fits with a tight context budget. Practical, but you give up some KV cache headroom.
- 30B-A3B MoE in 4-bit can fit but you should plan capacity carefully and test.
- 235B-A22B is not realistic on a single 4090.
For a coding assistant where I want fast turnaround on a single user, I have been using Qwen3 8B or 14B. For a general Q-and-A endpoint where I want stronger reasoning, I drop to 32B in 4-bit and accept a smaller context window.
Quantization options
Qwen3 weights are released under Apache-2.0. The Qwen team and the community publish:
- BF16/FP16 base weights from the official Qwen Hugging Face org.
- Community AWQ and GPTQ 4-bit quants for most sizes.
- GGUF quants for use with llama.cpp.
For vLLM specifically, AWQ and GPTQ are the practical 4-bit options at the moment. vLLM also supports FP8 on capable hardware, but the 4090 does not have native FP8 like Hopper-class cards, so 4-bit weight-only is the typical fit.
Installing vLLM
vLLM is normally installed in a fresh virtual environment. I use uv but plain pip works fine.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install vllm
Make sure your CUDA driver is recent enough for the vLLM build you installed. The vLLM project documents supported CUDA versions for each release; check the upstream notes if you hit a CUDA library mismatch at startup.
Starting the OpenAI-compatible server
vLLM ships an OpenAI-compatible HTTP server. The basic command for an 8B model in BF16 looks like this:
vllm serve Qwen/Qwen3-8B \
--port 8000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16
A few flags worth knowing:
--max-model-lencontrols the maximum context window. Larger values reserve more KV cache space and reduce headroom for batched users. On a single GPU, set it to the smallest value that covers your real prompts.--gpu-memory-utilizationis the fraction of VRAM vLLM is allowed to use. 0.90 is a reasonable default for a dedicated machine. Lower it if you also want to run a desktop session on the same GPU.--dtype bfloat16is fine for any modern Ada Lovelace card.--tensor-parallel-size 1is the default and what you want on a single GPU.
For a 4-bit AWQ build:
vllm serve Qwen/Qwen3-14B-AWQ \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--quantization awq
The exact Hugging Face repo names will depend on which community 4-bit build you trust. Always check that the publisher is a name you recognize, and prefer the official Qwen org when an official quant exists.
Talking to the endpoint
Once vLLM is up, it exposes the OpenAI-compatible API at /v1. Any client that already speaks OpenAI works against it. The simplest curl test:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "Write a haiku about kv cache."}],
"temperature": 0.2
}'
For Python, the official openai SDK works if you set base_url to your local server. For an editor like Aider, you set the base URL and the model name and it just works.
Qualitative experience on a single 4090
Some honest observations from running this for a few weeks:
- Qwen3 8B in BF16 is interactive at moderate context lengths. Single-turn coding questions feel snappy. Long conversations with full context start to feel slower simply because attention scales with sequence length, not because vLLM is doing anything wrong.
- Qwen3 14B in AWQ is the model I default to for harder coding questions. Single-stream latency is noticeably higher than 8B, but still fine for an interactive editor session.
- Qwen3 32B in 4-bit feels like a different tool. It is more capable on reasoning-heavy prompts. For one user it is usable. For a small team hitting it concurrently, you would want better hardware or a smaller model.
- Thinking mode produces longer outputs. If you enable it, expect longer end-to-end response times on every request that triggers reasoning.
A note on alternative engines
If 4-bit GGUF is your preferred format, llama.cpp is the canonical engine and Qwen3 is supported there too. If you want a vLLM-derived engine with broader sampler and quantization options, Aphrodite Engine is a fork built on vLLM's PagedAttention with a wider list of supported quants. Aphrodite is AGPL-3.0, so check the license against your use case.
Wrapping up
A single 4090 will not host a frontier MoE, but it will host a very competent Qwen3 dense model with vLLM and a 4-bit quant. The path is short: install vLLM, pick a size that fits, choose your quant, point your tools at http://localhost:8000/v1. From there it is the same code you would write against any cloud provider, just running on your desk.
Tools mentioned in this post
- vLLM: Apache-2.0 inference and serving engine with PagedAttention, continuous batching, and an OpenAI-compatible server.
- Qwen3: Alibaba's Apache-2.0 model series with dense and MoE variants and a thinking mode for harder reasoning.
- Aphrodite Engine: AGPL-3.0 vLLM-derived engine with extended quantization and sampler support.
Related Tools
More Articles
SGLang and the Structured-Output Renaissance
Constrained generation used to be a library you bolted on. It is becoming a feature of the inference engine. Why that matters for agent reliability.
Why Aphrodite Engine Is the Dark Horse of LLM Serving
Aphrodite Engine forks vLLM and adds the long tail of quantization formats and samplers that the community-quantized model world actually uses. Here is what it does well and where vLLM still wins.
ComfyUI vs SwarmUI: Which Stable Diffusion UI to Pick in 2026
A direct comparison of ComfyUI and SwarmUI: ComfyUI is the node-graph engine power users love, SwarmUI wraps it in a friendlier interface. Who each is for, what extensions look like, and the deployment story.