Tools/Large Language Models (LLMs)/Pixtral

Pixtral

Multimodal vision-language model by Mistral AI for image understanding.

Open SourceSelf HostedOffline CapableGPU Required (12GB+ VRAM)

0.0 (0)

About

Mistral AI's first natively multimodal open-weight model, Pixtral 12B pairs a 12 billion parameter language decoder with a 400 million parameter vision encoder trained from scratch on interleaved image and text data. It ingests images at variable sizes alongside text, handles multiple images in a single conversation, and supports a 128,000 token context window. The model targets document and chart understanding, visual question answering, and multimodal reasoning, scoring 52.5 on MMMU and 58.0 on MathVista while keeping strong text-only performance, including 69.2 percent on MMLU and 72 percent on HumanEval. Weights are published on Hugging Face under the Apache 2.0 license, and Mistral recommends serving with vLLM in production, which supports multi-image, multi-turn usage; the mistral-inference library works for local experimentation. The model ships without built-in moderation, so deployers add their own safeguards. It suits teams wanting a permissively licensed vision-language model they can self-host.