Google recently released and open sourced the Gemma 4 12B version multi-modal model. The development goal of this model is to allow consumer-grade devices to run AI models locally. According to Google's tests, the model can run on laptops and desktops with 16GB of memory/video memory, thanks to the small-scale parameters of 12B, but the model is as intelligent as the Gemma 26B version model.

Model advantages include:
New unified architecture: No need for multi-modal encoders, direct support for text, image, video and audio input.
Advanced inference capabilities: Benchmark performance is close to the Gemma version 26B hybrid expert architecture model, which can provide multi-step inference locally.
Low memory requirements: only 16GB of memory or video memory is required to run locally, although more memory will provide better performance.
Model open source release: The model is released under the Apache 2.0 license, and Google and the community also provide complete developer ecosystem support.
Predictive selector: Gemma 4 12B version is equipped with a variety of Token predictive selectors, which can effectively reduce delays.
More about the model:
The intelligence of Gemma 4 12B in the annotation benchmark test is close to the 26B MoE hybrid architecture model previously released by Google as open source. However, the 12B version has very low memory requirements and can be run directly on consumer-grade laptops and desktops equipped with 16GB of memory or video memory, allowing users to experience powerful multi-modal and intelligent interaction experiences locally.
The outstanding advantages of this model also include simplifying the processing of image, video, and audio input. Traditional multi-modal models usually rely on independent encoders to convert images and audio, and then pass the converted representation to the language model. Since these separate encoders will increase latency and memory usage, Google uses an encoder-less architecture to train the Gemma 4 12B model, so that the model can directly integrate audio and visual input.
Vision: Use a lightweight embedding module to replace the Gemma 4 visual encoder. This module only contains 1 matrix multiplication, position embedding and normalization operations, which allows the model backbone network to directly take over visual processing.
Audio: Google removed the audio encoder entirely, projecting the raw audio signal into the same dimensional space as the text markup.
Try and download the model:
Currently, Gemma 4 12B version has been provided on multiple platforms. Interested developers can experience it directly in Ollama, etc., or go to HuggingFace or Kaggle to download the model weight file. Developers can also use Unsloth for efficient fine-tuning to customize the version they need.
Ollama: https://ollama.com/library/gemma4
HuggingFace: https://huggingface.co/collections/google/gemma-4
Unsloth: https://unsloth.ai/docs/models/gemma-4