LLM ranking update: Google Bard surpasses GPT-4 Chinese players are not in the top ten

Today, Google Bard’s ranking surpassed GPT-4 in Imsys’ LLMs qualifying competition and jumped directly to second place.(But not more than OpenAI's latest GPT-4Turbo model): When encountering this good thing, Google's chief scientist Jeff Dean was of course the first to come to "show off" and bring his own GeminiPro model.

Ranking introduction

This LLMs ranking (ChatbotArena benchmark platform) was initiated by the LMSYS (Large Model Systems Organization) organization led by UC Berkeley researchers. Rankings are derived based on the Elo rating system through random and anonymous 1V1battles among LLMs.

As shown in the figure below, you can ask any question. The left side is the answer of model A, and the right side is the answer of model B. Then you can rate the answers of the two models. There are four options in total: "A is better; B is better; A is as good as B; A is as bad as B." If you can't decide in one round of chat, you can continue chatting until you choose the one you think is better, but if the identity of the big model is exposed during the chat, the vote will not be counted.

The figure below shows the proportional distribution of the probability of winning (excluding draws) of model A when playing against model B:

The figure below shows the number of battles for each model combination (no ties))

The graph below shows the average win rate of a single model relative to all other models:

OpenAI dominates the list, but Chinese players are not in the top ten

The picture below shows the current Top 10 rankings on the list. It can be seen that the GPT-series models still have an absolute advantage (three of the top four), while Anthropic's Claude series models occupy three of the top ten. Mistral, a company that claims to be the European version of OpenAI, also has two models in the top ten this time.

Also, please look at the rightmost column in the picture above.Among the top 10 models, 9 are closed source private models, which shows that the open source model still has a way to go.

It is a pity that the large language model of Chinese players did not enter the top ten.

Among them, the highest-ranking model is the Yi-34B-Chat model owned by Kai-Fu Lee's startup company Zero-One, ranking 13th.

Followed by Alibaba’s Tongyi Qianwen Qwen-14B-chat model, ranked 36th:

Then there is the ChatGLM series model of Tsinghua Professor Tang Jie’s startup company Zhipu AI:

Three points need to be explained:

1. There are many models developed by major Chinese manufacturers that may not be included in this list;

2. This list is for the global public, so far more users choose to chat in English than in Chinese, which may be detrimental to the large language model developed by Chinese players;

3. This list only counts the random questions and chats of 200,000 users, which represents the real evaluation of users chatting with LLMs. However, due to the unevenness of users’ questions and professionalism, the evaluation has a certain degree of subjectivity.

Finally, let’s talk about Google. At a time when layoffs and scientists are leaving to start businesses, internal and external troubles (for details, please go to Google’s Crisis Breakout! Scientists are leaving to start businesses, employees are being laid off...), can Google complete the "Empire Strikes Back" in 24 years?

Let’s wait and see!