When news spread that Google Gemini 3 would be launched,Musk took a step faster and silently released a big move. Early this morning,xAIThe latest large modelGrok 4.1It went online directly, the response rate was significantly improved, the hallucination rate was significantly reduced, and the answers were both accurate and "humane".

This time a total ofTwo "forms":Grok 4.1andGrok 4.1 Thinking. The Thinking version is an enhanced inference variant of the former. Both are based on the same underlying model and only have different inference configurations.

It is worth mentioning thatGrok 4.1 is free for everyone, in addition to being used on the Grok official website and X,A mobile APP version has also been launched, both iOS and Android systems are taken care of.


If you want a more in-depth and professional answer, you can "make Think think harder" with one click.

fromLMArenaLooking at the latest results, Grok 4.1 Thinking1483 EloFault leads the way,31 points higher than Gemini 2.5 Pro.

Even without thinking chain enabled, Grok 4.1 still remains at second place on the list, showing the stability of the underlying capabilities.


Many netizens exclaimed, "It smells really good." Be like:


Of course, there are also some doubts. For example, some people pointed out that Grok is not very competitive in generating code.


"Dual-form" Grok4.1 dominates LMArena

First, aboutWhat is Grok4.1 and Grok4.1 Thinking, we might as well take a lookGrok4.1 own explanation:

Grok 4.1 is the latest cutting-edge large language model (an upgraded version of Grok 4) released by xAI on November 17, 2025. It has greatly improved conversational intelligence, emotional understanding, creative writing, factual accuracy and response speed.

Grok 4.1 Thinking (sometimes referred to as Grok 4.1 Thinking, codenamed quasarflux) is a thinking/reasoning mode (reasoning mode) of the same model. It will additionally use "thinking tokens" for chain-of-thought, suitable for complex mathematics, programming or multi-step problems.

Grok 4.1 Thinking is an enhanced inference variant of Grok 4.1; both are based on the same underlying model and only have different inference configurations.


In the world's largest and most influential large model blind testing platformLMArenaOn the market, Grok4.1 showed breakthrough capabilities.

As an "unofficial standard list" generally recognized by the industry, LMArena evaluates model quality through anonymous double-blind battles and real user voting. It is a regular place for leading companies such as OpenAI, Google, Anthropic, and Meta to test new models. It is also often used to release unpublished versions in advance.

Therefore, winning here almost means the dual recognition of real user preferences and the comprehensive ability of the model. It is a way to observe the true strength of the model.The most credible indicator.

In such a highly competitive public arena, xAI’s Grok 4.1 series won a very valuable event."Double Crown":The Grok 4.1 Thinking version won the championship with 1483 Elo, while the non-reasoning version Grok 4.1 also won the runner-up with 1465 Elo.

What is particularly noteworthy is that the performance of this "instant response" non-inference version actually surpasses all other manufacturers' inference models.For the first time, "Quick Model" has also reached the first echelon of top performance, and also left the previous generation Grok 4 far behind to 33rd place.

The key behind the outstanding results lies inReconstruction of training methods.

xAI introduced for Grok 4.1Large-scale reinforcement learning system, and useCutting-edge inference models as reward models, allowing it to independently evaluate and quickly iterate during the training process. This directly leads to more stable style output, more reliable fact judgment and lower hallucination rate.

In the post-training phase of Grok 4.1, xAI focuses optimization on information retrieval prompts.hallucinationsuperior.

These changes in underlying methods quickly turned into significant factual improvements in actual testing. The latest data shows thatGrok 4.1 hallucination rate has dropped from 12.09% to 4.22%, a drop of nearly three times, becoming one of the most prominent improvements in this upgrade.

In order to further verify this "more accurate" ability, the team also introduced a more stringent external benchmark system. One of the most critical metrics is the FActScore - a set of 500 real-life biographical questions designed to test the model's performance in search, fact determination and answer consistency.


In this test, the Grok 4.1's FActScore dropped from 9.89 to 2.97, an equally significant improvement in credibility. Combined with the chart, you can see more intuitively: In the same non-inference mode, Grok 4.1 has fewer errors, smaller deviations, and the overall output is more reliable.

This means that in scenarios involving retrieving, referencing, or invoking external facts, the new model no longer relies on semantic guessing but can more accurately give evidence-based answers.

In other words, Grok 4.1 has taken a key step in the "factual stability" that is the most difficult to break through in large models - it not only reduces the number of errors, but also reduces "false confidence". And this is exactly the threshold that large models must cross to move from "speakable" to "believable".

Meanwhile, Grok 4.1’s"Emotional Intelligence"There has also been significant progress.

In the EQ-Bench test, Grok 4.1 scored a high score of 1586 Elo, which is more than a hundred points higher than the previous generation. If just looking at the numbers is not intuitive enough, then the pictures can explain the problem better: on the list, Grok 4.1 and Thinking version firmly occupy the top two, leaving a number of flagship models behind. Old powerhouses such as GPT-5 Chat, Gemini 2.5 Pro, and Claude Opus 4 have all been easily opened up by it.

EQ-Bench is a large-model emotional intelligence test set judged by a large model that assesses proactive emotional understanding, insight, empathy, and interpersonal skills. It does not rely on a single round of questions and answers, but consists of 45 role-playing scenarios, each containing 3 rounds, simulating real "emotional conversations" in the real world. Models need to maintain a consistent style, understand emotional context, and respond appropriately across consecutive conversations. The final results are obtained through pairwise comparison and are normalized in Elo form. It can be said that EQ-Bench can be used as an authoritative list to test the "emotional intelligence" of each model.


Why can Grok 4.1 achieve such outstanding results in EQ-Bench?

We can find the answer in an official comparison chart about "comforting lost cats".

The reply of the old version of Grok is already gentle and considerate, but the expression of Grok 4.1 is obviously more delicate: it not only says "I understand your sadness", but also captures the more subtle and real details of the emotion - such as the empty sleeping nest, the meow that you look forward to but can no longer hear, the kind of sadness that comes over again like a tide. The tone is steadier, the rhythm is more natural, and the emotional resonance is more in place. It reads like you are having a conversation with someone who really understands you.


This puts Grok 4.1 into the first echelon of emotional understanding

In addition to its factual reliability, Grok 4.1creative writing skillsThere was also a huge jump.

In Creative Writing v3, the score of Grok4.1 jumped to 1722Elo, which is almost 600 points higher than the previous version. The narrative rhythm, style ductility and creativity of the text have all jumped up.

The benchmark itself, Creative Writing v3, is not a simple “single round of scoring.” In the test, the model was required to conduct three rounds of independent creation around 32 different categories of writing prompts, covering complex tasks such as narrative, style imitation, world building, and character emotion portrayal. The test was not the ingenuity of a sentence, but the sustained and stable text creation ability. The scoring method is also similar to EQ-Bench. The standardized Elo score is obtained through manual scoring criteria and model competition.


In this list, Grok 4.1 Thinking and Grok 4.1 occupy second and third place, with only a dozen points difference between the two; while other strong models such as O3, Claude Sonnet 4.5, Kimi K2 and the old Grok 3 are firmly left behind, forming an obvious grade stratification.

In other words, Grok 4.1 has entered the world’s strongest “creative writing echelon”.

From the official comparison of the old and new versions, we can clearly see that Grok 4.1 has jumped from a model that can write jokes to a creator with real literary touches: deeper narratives, more complex emotions, more mature rhetoric, and more immersive characters.


These upgrades are ultimately reflected inBetter interactive experiencesuperior. Grok 4.1 has a more stable "personality", a more detailed understanding of user intentions, and more natural style adjustment. Even in non-reasoning mode, it can stably output high-quality answers while maintaining extremely fast response speed.

An intuitive example is the comparison of travel strategies officially displayed. The content given by the old version of Grok is like an "encyclopedia-style overview of attractions", with high information density but lack of rhythm; while when writing about San Francisco, Grok 4.1 is like a local guide who has truly "been there" and "understands the atmosphere". It will proactively prompt you for photo taking times, recommend routes that suit you, and even bring out the specific temperament of the city, making it more like communicating with a real person.


In complex task processing, Grok 4.1's context window is expanded to 256K tokens, and up to 2 million in Fast mode, allowing it to maintain high coherence and significantly reduce "fragments" in long document understanding, continuous collaboration, and large-scale content generation.

Overall, the improvement of Grok 4.1 is not a single breakthrough, but a comprehensive upgrade from performance and factuality to emotional intelligence, creativity and interactive experience.

Before its official debut, Grok 4.1 had actually quietly gone through a two-week "silent release". From November 1 to 14, 2025, xAI will gradually switch a portion of real user traffic to Grok 4.1 in grok.com, X, and mobile applications to observe its performance in real environments.

The most intuitive result of this stage is clearly reflected in the 64.78% pie chart: under the premise of double-blind comparison and the user's ignorance, Grok 4.1's answer has a 64.78% probability of being selected as "better" by the user. In other words, faced with the same problem, users prefer Grok 4.1 in more than 60% of cases.

It can be said that the higher emotional understanding, more stable factual responses, and more natural interaction style demonstrated by Grok 4.1 were all "stamped" by real users' votes through silent testing.


Whether it is the LMArena double crown, the cliff-like drop in the hallucination rate, or the overall enhancement of creative writing and emotional capabilities, the new generation of Grok has moved from "strong functionality" to "strong experience" and has also delivered a very convincing answer for xAI in this year's large model competition.

We actually tested Grok4.1

AI Frontline also started testing Grok4.1.

first isreasoning abilityFor the test, we designed a question that looks normal but is actually "fraudulent" (with 2 sets of solutions) (you can verify it yourself):

"Four students participated in the mathematics competition, namely: Little A, Little B, Little C, and Little D. After the competition, they made the following four judgments about their rankings: (1) Little A said: "I am not the first place. " (2) Little B said: "I am not the last one either. " (3) Little C said: "I am the second place" (4) Little D said: "I am not the last place. ” Known:Only one of these four sentences is true, and the rankings of the four people are different.

Question: Which sentence is true? What is the ranking of each of the four people? Please give your reasoning process. "

Grok successfully found 2 sets of solutions and proactively fixed bugs.


However, it needs to be noted thatIt actually "rolled over" when it took the initiative to fix the problem bug., Grok proposed that if what little C said was changed to: "Little B is the second place", then the answer would be unique.

But after the modification, there are actually many results: first, if only B is telling the truth, the only rankings determined at this time are A1, C2, B3, and D4; second, if only D is telling the truth, only A1 and B4 can be determined at this time. C and D are ranked 2nd and 3rd respectively, but they are not unique.

Let’s take a look at Grok’swriting ability.

We give a prompt like this:

Use a storytelling tone to tell the story of Musk’s xAI’s release of Grok4.1 accurately, vividly, and contagiously. Required word count: 500-600 words, must include: release time, product highlights, market background, etc.

Grok4.1’s answer is as follows, and he also thoughtfully counted the number of words: 578—but, let’s just say, Grok probably counts the number of English words (or is he bad at math?), and we manually counted the number of words using Word: 861 words.


Finally, we tested Grok4.1’sImage generation capabilities, the effect is good: Grok generated two pictures based on a Prompt, which are really like real photos (but please evaluate the details by yourself).


Moreover, it can also generate videos directly based on images with one click. The effect is as follows:


Interested readers can also try it out.