Mistral, a French AI startup, has released Pixtral 12B, its first model that can handle both images and text.
Pixtral 12B is based on Nemo 12B, a text model developed by Mistral. The new model includes a 400-million-parameter vision adapter, allowing users to input images alongside text for tasks such as image captioning, counting objects in an image, and image classification—similar to other multimodal models like Anthropic’s Claude and OpenAI’s GPT-4. Images can be provided either through URLs or encoded via base64.
When processing images, Pixtral 12B divides them into 16 x 16 pixel patches, enabling it to handle high-resolution images more effectively. The model uses 2D RoPE (Rotary Position Embeddings) for the vision encoder, allowing it to better understand spatial relationships within the provided images.
We dropped a new model – Pixtral 12B, our first-ever multimodal model. Enjoy! 🥰🎉 https://t.co/uvXnpJf6mQ
— Sophia Yang, Ph.D. (@sophiamyang) September 11, 2024
Pixtral 12B features 12 billion parameters that essentially reflect the model’s problem-solving ability. The more parameters, the better the model typically is at solving complex problems. For comparison, GPT-3 has over 175 billion parameters, highlighting that Pixtral 12B still has a long way to go to compete with OpenAI’s more than year-old model.
Mistral Pixtral 12B is available for download via a torrent link on GitHub and Hugging Face. Mistral hasn’t clarified under which license Pixtral 12B is released, but some of Mistral’s previous models were released under Apache 2.0, so it’s possible Pixtral 12B follows the same licensing.
As of now, the model is free to use for research and academic purposes but requires a paid license for commercial use. Additionally, Mistral’s Head of Developer Relations, Sophia Yang, said the model will soon be available for testing on Mistral’s chatbot and API platforms, Le Chat and Le Platform.