Microsoft releases Fara-7B, which can run directly on PC and has performance comparable to GPT-4o

On November 24, Microsoft announced the launch of the 7B parameter AI model Fara-7B, positioned as a "Computer Use Agent (CUA)" that can run complex tasks directly on the user's local device. Fara-7B not only achieves the best performance at the same scale, but also frees AI agents from relying on huge cloud models, achieving low latency and stronger data privacy guarantees on systems with limited resources.

According to reports, the architecture of Fara-7B directly addresses the data security needs that enterprise users are most concerned about. Because the model is streamlined enough to run locally, users can automate sensitive workflows (such as internal account management or confidential data processing) without the relevant information ever leaving the local device, greatly improving privacy and compliance.

Fara-7B interacts with web pages through "screen-viewing operations". It uses screenshots to "visually perceive" the page layout like humans do, predicting coordinates to complete clicks, inputs, scrolling and other actions, and does not rely on the underlying accessibility tree structure of the browser. This method of "operating" solely on pixel-level visual information allows it to work properly on websites with confusing code structures and pages that are difficult to parse.

Yash Lara, senior product manager at Microsoft Research, said that processing visual input completely locally achieves true "pixel sovereignty", allowing automation and data reasoning processes to stay local, thus meeting the compliance needs of highly regulated industries such as medical and financial industries.

In standard tests such as WebVoyager, Fara-7B's task success rate is 73.5%, which is better than models such as GPT-4o (65.1%) and UI-TARS-1.5-7B (66.4%), which consume more resources. At the same time, Fara-7B only needs 16 steps on average to complete the task, while UI-TARS-1.5-7B takes 41 steps, which significantly improves efficiency. In addition, Fara-7B shows the best price/performance ratio between accuracy and cost.

However, Microsoft also emphasized that this model still has problems with general AI systems, such as hallucinations and errors in processing complex instructions. To reduce risks, Fara-7B introduces a "key point" mechanism: before involving user personal data or irreversible actions (such as sending emails, financial operations), the model will actively pause and request user confirmation. Microsoft has designed a supporting human-computer interaction UI (Magenic-UI) to allow users to intervene in time and avoid excessive interruption.

Fara-7B was developed using a "knowledge distillation" approach to compress and transform a large number of successful cases of multi-agent systems (145,000 autonomous navigation trajectories generated by Magentic-One) into a single model. Its base model is Qwen2.5-VL-7B, which has a context window of up to 128,000 words and powerful text and visual element alignment capabilities. The entire process focuses on supervised fine-tuning, allowing the model to "imitate" the demonstrations of human experts.

For the future, Microsoft emphasized that it will not blindly increase the size of the model, but will focus on "making small models smarter and safer." The follow-up plan is to introduce the reinforcement learning mechanism (RL) in the synthetic environment into training, allowing Fara-7B to learn autonomously in the sandbox environment.

Currently, Fara-7B has been released on Hugging Face and Microsoft Foundry platforms through the MIT protocol, allowing commercial use. However, Microsoft reminds that the model has not yet reached production level and is mainly suitable for prototype development and testing.