Tools/LLM Inference & Serving/KTransformers

KTransformers

Heterogeneous CPU and GPU inference framework for very large language models on limited hardware.

Open SourceSelf HostedOffline CapableGPU Required (24GB+ VRAM)

0.0 (0)

About

KTransformers tackles the problem of running very large language models on hardware that cannot hold them in GPU memory. Developed by Tsinghua University's MADSys Lab together with Approaching.AI, it splits inference across CPU and GPU, using CPU optimized kernels with Intel AMX and AVX acceleration for INT4 and INT8 quantized weights and NUMA aware memory management tuned for Mixture-of-Experts architectures. Models such as DeepSeek-V3 and R1, Qwen3, Kimi, GLM, and MiniMax variants can run, and even fine-tune, on machines pairing a single consumer GPU with a capable CPU, while multi GPU Xeon servers reach hundreds of tokens per second. The kt-kernel serving stack integrates with SGLang for production deployment, and fine-tuning hooks into LLaMA-Factory with reported speedups of 6 to 12x over ZeRO-Offload on MoE workloads. The open source project targets researchers and practitioners serving huge MoE models on constrained budgets.