Fine-Tuning Llama 3.3 with Unsloth on a 16GB GPU, Step-by-Step
Fine-Tuning Llama 3.3 with Unsloth on a 16GB GPU, Step-by-Step
Fine-tuning used to mean either renting eight A100s for a week or giving up. The combination of LoRA, 4-bit quantization, and a few clever kernel tricks has changed that. Today you can take a Llama 3.3 base model, teach it your domain, export it to GGUF, and run inference on the same laptop, all with one open-source library doing the heavy lifting.
This post walks through that loop using Unsloth, with a quick look at Axolotl as the more configuration-driven alternative, and llama.cpp as the runtime you will probably ship to. If you have not done a fine-tune before, this is a workable path from zero to a working LoRA adapter.
Why Unsloth
Unsloth is an open-source framework for faster, more memory-efficient training of large language models. The README claims around 2x faster training with about 70 percent less VRAM versus stock approaches, and the project supports a long list of models including Gemma, Qwen, Llama, DeepSeek, Mistral, gpt-oss, and Phi. It supports LoRA, full fine-tuning, 4-bit quantization, GGUF export, and reinforcement learning methods like GRPO.
The relevant features for a 16 GB consumer GPU are LoRA fine-tuning, 4-bit quantization, and GGUF export. Together, those three are why this is plausible on a single mid-range card.
Step 1: Prepare your dataset
Unsloth, like most fine-tuning frameworks, eats JSONL. Each line is a JSON object with at least one field for the prompt and one for the expected completion. A common shape is the messages format:
{"messages": [{"role": "user", "content": "What is our PTO policy?"}, {"role": "assistant", "content": "All full-time employees get 20 days per year, accrued monthly."}]}
{"messages": [{"role": "user", "content": "Who do I contact for IT issues?"}, {"role": "assistant", "content": "Email it-help at example.com or open a ticket in the help portal."}]}
A few practical notes:
- Quality beats quantity. A few hundred clean examples beats ten thousand noisy ones.
- Diversity matters. Cover the question phrasings your users actually use.
- Keep formatting consistent. If half your responses end with a period and half do not, the model will learn the pattern, which usually is not what you want.
Step 2: Install and load the model
Unsloth installs cleanly with pip or uv:
pip install unsloth
Then in Python, load a 4-bit quantized base model:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
The load_in_4bit flag is what makes this fit. The model weights live in 4-bit precision in VRAM, while activations and gradients still flow in higher precision during training. It is a meaningful win on memory.
Step 3: Configure LoRA
LoRA freezes the original weights and trains a small number of low-rank adapter matrices instead. You will end up with a few tens of megabytes of trained weights instead of a full multi-gigabyte checkpoint.
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
Two knobs to know:
ris the LoRA rank. Higher r means more capacity, more memory, slower training. A starting value of 8 to 32 is common.target_modulescontrols which layers get adapters. Including all the projection matrices is the safe default.
Step 4: Train
Use Hugging Face's SFTTrainer to do the actual training loop:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
trainer.train()
If you run out of memory, lower per_device_train_batch_size and raise gradient_accumulation_steps. The effective batch size is the product of the two, so you can keep the same training dynamics on less memory.
Step 5: Export to GGUF for llama.cpp
Once training is done, you can save the merged model in GGUF, the format llama.cpp uses for inference. Unsloth's README documents GGUF export as a first-class feature.
model.save_pretrained_gguf(
"outputs/llama3-finetuned",
tokenizer,
quantization_method="q4_k_m",
)
q4_k_m is a popular 4-bit quantization that balances size and quality. From there you can run with llama.cpp:
./llama-cli -m outputs/llama3-finetuned/llama3-finetuned.gguf -p "What is our PTO policy?"
llama.cpp supports a wide spectrum of quantization options from 1.5-bit through 8-bit, plus a long list of hardware backends including CUDA, Metal, Vulkan, and SYCL.
When to reach for Axolotl instead
Unsloth is fast and friendly. Axolotl is more configurable. If you want to fine-tune with a YAML config that drives the entire pipeline, want preference tuning methods like DPO, KTO, or ORPO, or want full multi-GPU FSDP and DeepSpeed support, Axolotl is the tool. The same dataset usually works in both. Our best AI tools for Python developers post covers more of the surrounding ecosystem.
You can read the projects on GitHub: Unsloth, llama.cpp, and Axolotl.