In December 2024, Microsoft launched Phi-4, a small language model (SLM) with the most advanced performance in its class. Today, Microsoft is expanding the Phi-4 series with two new models: Phi-4-multimodal and Phi-4-mini. The new Phi-4 multimodal model supports speech, vision and text simultaneously, while Phi-4-mini focuses on text-based tasks.

Phi-4-multimodal is a 5.6B parameter model and Microsoft's first multimodal language model that integrates speech, vision and text processing into a unified architecture. As shown in the table below, Phi-4-multimodal achieves better performance across multiple benchmarks compared to other existing state-of-the-art omnidirectional models such as Google's Gemini2.0Flash and Gemini2.0FlashLite.

In speech-related tasks, Phi-4-multimodal outperforms professional speech models such as WhisperV3 and SeamlessM4T-v2-Large in both automatic speech recognition (ASR) and speech translation (ST). The model topped the HuggingFaceOpenASR rankings with an astonishing word error rate of 6.14%.

In vision-related tasks, Phi-4-multimodal performed well in mathematical and scientific reasoning. This new model is comparable to or even exceeds popular models such as Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet in terms of common multi-modal capabilities such as document and diagram understanding, OCR, and visual scientific reasoning.

Phi-4-mini is a 3.8B parameter model that outperforms several popular large-scale LLMs on text-based tasks including reasoning, mathematics, coding, instruction following, and function calling.

To ensure the security of these new models, Microsoft worked with internal and external security experts to conduct testing and adopt strategies developed by the Microsoft AI Red Team (AIRT). Both Phi-4-mini and Phi-4-multimodal models can be deployed on-device after being further optimized using ONNX Runtime for cross-platform usability, making them suitable for low-cost, low-latency application scenarios.

Both Phi-4-multimodal and Phi-4-mini models are now available to developers in AzureAIFoundry, HuggingFace, and NVIDIAAPICatalog. Developers can consult the technical documentation to understand the purpose of the recommended model and its limitations.

These new Phi-4 models represent a major advance in efficient artificial intelligence, bringing powerful multimodal and text-based capabilities to a variety of artificial intelligence applications.