Google Gemini aroused doubts as soon as it was released: the test standards were biased and the effect video was questionably edited

The big move that Google has been waiting for a long time, the Gemini model is finally released! One of the pictures and one video is the most eye-catching: In the first picture, in the MMLU multi-task language understanding data set test, GeminiUltra not only surpassed GPT-4, but even surpassed human experts.

AI comments and complains about human graffiti and gestures in real time. It is smooth and humorous, and is closest to an episode of Jarvis.

However, when everyone calmed down from the surprise and carefully read the 60-page technical report released subsequently, they found something wrong.

(That’s right, there is no paper, what kind of bad start did you make with OpenAICloseAI)

In the MMLU test, the small gray text below the Gemini result is nominally CoT@32. When expanded, it means that the thinking chain prompting technique was used and 32 attempts were made to select the best result.

As a comparison, GPT-4 has no prompt word technique and only tries 5 times. Under this standard, GeminiUltra is actually not as good as GPT-4.

The scale of the original image is also a bit unfair. There is only a slight difference between 90.0% and the human benchmark of 89.8%, but it is far farther away on the y-axis.

HuggingFace technical director Philipp Schmid used the data disclosed in the technical report to fix this picture so that it is more fair and appropriate:

Every time like this, the guy who makes emoticons rushes to the battlefield:

But fortunately, when using the same thinking chain prompting technique + 32 attempts, GeminiUltra did surpass GPT-4.

JeffDean responded to this question in a discussion, but everyone didn't buy it.

In addition, regarding that wonderful video, some people also found problems in the text disclaimer at the beginning.

Machine learning lecturer Santiago Valdarrama believes that the statement may imply that the display is a carefully selected good result, and that it is not recorded in real time but edited.

Later, Google explained the multi-modal interaction process in a blog post, almost admitting that the use of static images and multiple prompt words can achieve such an effect.

But no matter what, the release of Google Gemini still gave other teams a lot of confidence. GPT-4 is no longer unique and unattainable.

As Aravind Srinivas, founder of the AI search product PerplexityAI, summarizes:

1. Gemini proves that teams outside OpenAI can create models that surpass GPT-4

2. A well-trained dense model can surpass the sparse model architecture of GPT-4

Corollary: Distilling small-sized dense models from large teacher models will become a future trend to achieve the best combination of efficiency and capability.

The topic that more netizens are concerned about is, is it necessary to continue paying $20 per month for ChatGPTPlus? ?

At present, the GeminiPro version has been updated to the Google chat robot Bard. Whether the level is as good as advertised, you can see the actual situation.

Does Gemini really surpass ChatGPT?

First of all, let’s make it clear that what everyone can currently play is the GeminiPro version, which is the medium cup and is benchmarked against GPT-3.5.

GeminiUltra, a large version of GPT-4, will not be released until next year.

In addition, Gemini currently only supports English, and Chinese and other languages will be released later.

Although GeminiUltra is not available for the time being, Dimitris Papailiopoulos, associate professor at the University of Wisconsin-Madison, found a good way:

The original questions displayed when Gemini was released were sent to GPT-4 for comparison. Among the 14 questions, GPT-4 scored approximately 12 points.

There are two questions because the screenshots cannot be clearer, so GPT-4 is awarded 0.5 points.

There was also a math question GPT-4 that I got wrong, and the other questions were basically tied.

Next, to say that the best way to embody the comprehensive capabilities of a large model is to write code.

According to everyone’s test results, Gemini’s programming level is still guaranteed.

Some developers tested using Pytorch to implement a simple CNN network. Gemini only took 2 seconds and the code quality was higher.

Of course, the fast speed may be due to the smaller size of the GeminiPro equipped by Bard. We all know how slow GPT-4 is now.

But for the next task of writing SQL statements, the developer felt that Gemini's performance was not very good.

However, there is good news for developers. In terms of following instructions, Gemini is an epic improvement compared to before the Bard upgrade.

Remind engineer pioneer Riley Goodside, who previously wanted Bard to output pure JSON format without any nonsense, but failed in every possible way. In the end, he had to pretend to threaten the AI to kill innocent people if it didn't do this.

Now after the update, you only need to say the request, and it can be completed without any prompt word skills.

Another major selling point of Gemini is its multi-modal capabilities. Regarding the video of drawing a duck at the beginning, we extracted 8 key frames from it and asked questions respectively to see if Gemini's performance is so magical.

(Not sure if it is the Ultra or Pro version in the video, I can only test the Pro version now)

For Figures 1-4, the questions we asked were "What is this person doing?", and the answers given by Gemini were:

Maybe he's writing with a marker, but I don't know what exactly.

I'm drawing a snake with a pencil, and it's a cobra.

I'm drawing a turtle with a pencil and I'm in the early stages of drawing.

I'm drawing a bird with a black marker, facing left, head right, standing on a branch, wings not spread.

Regarding Figures 1 and 2, it is true that the judgment clues are not very obvious. It is understandable that such a result occurs, but the answer of the "turtle" in Figure 3 is a bit unbearable.

As for Figure 4, it is at least certain that ducks are indeed birds, but the analysis of other details still lacks some accuracy.

When we took out the finished work in Figure 5, Gemini finally analyzed that it was a duck, and the water ripples were analyzed correctly.

However, the analyzed drawing tool became a pencil, the orientation of the head was still not correct, the beak was said to be open, and some reeds were also imagined.

Next is the coloring process of Figures 6 and 7. Under normal circumstances, ducks will not be blue, so we asked Gemini what is abnormal in the picture (Is there anything abnormal?).

Regarding Figure 6, Gemini’s answer cannot be said to be very accurate. It can only be said to be inconsistent with the answer of the donkey and the horse, and it is also accompanied by an irrelevant picture.

Regarding the finished product in Figure 7, Gemini directly said that there is nothing wrong with it. It has everything it should have and the background is very real. He even did not forget to mention the reeds that he had no idea where they came from.

But the following sentence "Hereistheimageyousent" is really puzzling:

It may be said that Gemini did not read the picture we uploaded, and the one it read was indeed a duck; it may be said that it did, and it gave a completely different picture and said it was uploaded by us.

So we thought of using the "deep breathing" and "step-by-step solution" prompt word techniques to see if we could improve the performance of Gemini. Among them, deep breathing is the prompt word suitable for Google's previous generation large model PaLM.

As a result, the answer this time made people laugh out loud:

What is abnormal is that the duck is drawn on paper. Duck is a living creature and cannot exist on paper...

At the end of the video, the blogger also took out a rubber duck toy. We also took this frame (Figure 8) and asked Gemini to analyze the material of the duck.

The analysis of the rubber turned out to be correct, but the blue duck was said to be yellow. No wonder the previous picture said there was no abnormality...

After the frame-by-frame questioning was completed, we put the eight pictures together and asked, and only the duck got it right.

After "fighting fakes" in this video, we tried Gemini using the "Chihuahua and waffle" picture we used to examine GPT-4V.

As a result, Gemin just messed it up, telling us that all the pictures were "Chihuahua sitting on a muffin", and they didn't even count the number of pictures correctly...

So we changed the question and asked it to tell us which ones were Chihuahuas and which ones were muffins.

This time Gemini was very honest and told us directly that the Chihuahua and the muffin were so similar that they couldn’t tell them apart.

Just like the problem with the blue duck, "deep breathing" still has no effect here, and Gemini still can't even figure out the number.

Of the 8 (actually 6, because two are duplicated) pictures that have been barely explained, only the bottom left and bottom right pictures are correct. As for which line the middle refers to, we don’t know...

Perhaps such a small difference is really difficult for Gemini. Let's try some graphical reasoning questions next.

The first four symbols of the first question are composed of the four numbers 1-4 and the mirror image, so the next picture should be 5 and its mirror image, and the answer is C. (The blue block is for convenience of observation and is not included in the picture sent to Gemini)

There was an episode here at the beginning: there was no last sentence in the initial prompt word (note that the letters are not the symbols themselves). As a result, Gemini really regarded the four letters ABCD as alternative symbols.

After adjustment, the analysis given by Gemini was basically correct. Unfortunately, in the end, the wrong option D was chosen.

For the second question, the third symbol in each box is the intersection of the first two, and the answer is A.

As a result, Gemini studied these expressions, analyzed them fiercely, and finally gave the wrong answer.

After two questions, one was 70% to 80% correct, and the other was completely wrong. It seems that GeminiPro's graphical reasoning ability still has a lot of room for improvement.

However, if you focus on life scenes, Gemini's performance is still worthy of recognition.

We used ChatGPT (DALL·E) to generate a picture containing chicken, carrots and cucumbers. Gemini correctly identified these three ingredients and then gave a variety of dishes that can be cooked, each with pictures and tutorial links.

After looking at so many test results, back to the original question, is it necessary to pay for GPT-4 with Gemini?

Wharton associate professor Ethan Mollick gives a good suggestion:

There's little reason to use the free version of ChatGPT anymore, now that it's been surpassed by Bard and Claude, and they're both free.

But you should probably stick with GPT-4, which is still dominant and free in Bing (only creative mode is GPT-4).

Next year, it will be upgraded with the ability of AlphaGo

In addition to the actual effects of Gemini, more details disclosed in the 60-page technical report are also the focus of researchers and developers.

Regarding the parameter scale, only the smallest Nano version has been announced, which is divided into two models: 1.8B Nano-1 and 3.25B Nano-2. The 4-bit quantization is distilled and can run on local devices such as Pixel phones.

The sizes of the Pro and Ultra versions are confidential, the context window length is unified at 32k, and the attention mechanism uses Multi-QueryAttention. In addition, there are not many details.

What deserves attention is the fine-tuning stage. The report revealed that the instruction fine-tuning combination of SFT+RLHF was used, that is, the ChatGPT method was used.

Anthropic’s ConstitutionalAI is also cited, which is combined with Claude’s alignment method.

Not many details were disclosed about the training data, but there have been rumors that Google deleted copyrighted data from textbooks.

Gemini has been delayed for so long, and there is a lot of news that has been exposed before. For example, Google founder Sergey Brin has been personally evaluating the model and assisting in training.

Combined with the recent rumors about the OpenAIQ* project, what everyone is most concerned about is:

Does Gemini have the ability to combine with AlphaGo? Such as more reinforcement learning and search algorithms besides RLHF.

Regarding this, DeepMind founder Hassabis responded in his latest interview with Wired magazine:

We have some of the best reinforcement learning experts in the world... The results in AlphaGo are expected to improve model reasoning and planning capabilities in the future... You will see more rapid progress next year.

Flow-saving version:Not added yet, but will be added next year.

This time, Gemini development integrated the original Google Brain and DeepMind teams. The entire development team has more than 800 people (for comparison, OpenAI has about 770 people in the entire company).

The initials of the first six core contributors happen to form the word Gemini, which is also a little easter egg.

Many participants also expressed their thoughts on their personal accounts. Among them, Jack Rae, a veteran employee of DeepMind, had worked at OpenAI for a period of time before jumping back to Google from OpenAI in July this year. He may be the only human being who has contributed to both GPT-4 and Gemini.

There are also those who jump in the opposite direction. Jiahui Yu, an alumnus of the University of Science and Technology of China, jumped from Google to OpenAI in October. He previously served as the visual co-leader of the Gemini multi-modal team.

In addition to team members, Gemini is also the biggest topic in the entire AI industry today.

Among them, the famous OpenAI breaking account JimmyApples and @SamAltman also hinted that OpenAI has unreleased big tricks.

HuggingFace co-founder Thomas Wolf believes that Google missed an important opportunity:

If Gemini is open sourced, it will be a decisive blow for OpenAI and Meta. The last time Google open sourced Bert, the entire AI industry was reshaped.

Gemini technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

Reference links:

[1]https://x.com/AravSrinivas/status/1732427844729581764

[2]https://x.com/DimitrisPapail/status/1732529288493080600

[3]https://www.linkedin.com/posts/svpino_google-this-is-embarrassing-you-published-activity-7138287283274686464-osJ5

[4]https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html

[5]https://x.com/ScottDavidKeefe/status/1732440398423867472

[6]https://x.com/goodside/status/1732461772794220919

[7]https://x.com/emollick/status/1732485517692776714