Google is really here with Gemini, and its multi-modal capabilities shocked the entire network. The next-generation model will incorporate AlphaGo deep reinforcement learning technology and will be released in 2024. The model that can really challenge GPT-4 is Google Gemini. As soon as Gemini was released, the powerful multi-modal capability demonstration swept the entire Internet, and the topic of GPT-5 was instantly pushed to the hot search.


Throw away PaLM2, and the entire range of products such as GoogleBrad and Office Home Bucket will also be completely reborn, with the support of Gemini.

Google officials said that the Gemini Ultra cup will be released next year.


Before Gemini was officially released, people who had been exposed to internal testing commented, "If 2023 is the first year of large models, 2024 is likely to be the year of Gemini."

As Demis Hassabis, head of Google DeepMind, said, the era of Gemini has arrived.

It is revealed that AlphaGo deep reinforcement learning technology is being integrated into the Gemini model, and the next version in 2024 will be super evolved.

32k context, three cup types

ChatGPT has been in the limelight since its birth, making Sergey Brin, the co-founder who retreated behind the scenes, anxious.

In July, it was revealed that he had returned to the company to participate in the development of the next generation AI system.


His name is clearly listed in the list of authors of the Gemini paper.


https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf


Regarding the 60-page Gemini technical report, netizens made a condensed version.


1. Written in Jax and trained using TPU. Although not explained in detail, its architecture appears to be similar to Flamigo.

2. GeminiPro’s performance is similar to GPT-3.5, while GeminiUltra is said to be better than GPT-4. Nano-1 (1.8B parameters) and Nano-2 (3.25B parameters) are designed to run on end devices.

3.32K context length.

4. Very good at understanding vision and speech.

5. Coding ability: Compared with GPT-4, HumanEval has a huge jump (74.4%Vs67%). However, the Natural2Code benchmark shows a much smaller gap (74.9% Vs 73.9%).

6. Regarding MMLU: It seems a bit reluctant to use COT@32 (32 samples) to illustrate that Gemini is better than GPT-4. Among the 5 sample settings, GPT-4 is better (86.4%Vs83.7%).

7. No information about the training data other than ensuring that "all data enrichment workers are paid at least a local living wage".


The language understanding and generation performance of Gemini model in three sizes: medium cup, large cup and extra large cup with different abilities.


The following pictures are key comparative data.

Gemini performance on text benchmarks, compared with external models and PaLM2-L.


In terms of image understanding, GeminiUltra always outperforms all models.


Evaluation results on speech benchmarks show that GeminiPro outperforms other models in speech recognition and automatic speech translation.


Netizen comments

Interleaved text image generation

One developer, Brian Roemmele, found GeminiUltra to be slightly better.

According to the technical report, the GeminiUltra model is deeply trained on YouTube data, so that it can infer a series of still images from a scene in the video ("The Matrix") and write a text narrative from it.

After testing it on ChatGPT-4Turbo, Roemmele found that it could not reason about such output.



GeminiUltra also responds with a combination of images and text. This is called "interleaved text and image generation".

This is possible because the model is trained on multimodal inputs.


The following is GeminiUltra, which generates text and images from yarn balls to knitted finished products.


Multimodal+Tools

In this sample, we see that GeminiUltra fully utilizes the power of multi-modal training and fine-tuning when performing a task.

The scale of development of this synergy is the first of its kind in current artificial intelligence models. It combines multimodality with tool usage: Paint to search for music.



Revealing the "Magic"

What's even more amazing is that GeminiUltra can also understand magic.

Roemmele said that through the identification of classic magic, the characteristics of the Gemini unified multi-modal model can be seen. Thanks to the YouTube videos trained on the model, it can understand the sequence and draw conclusions through logic.


Next, it’s time to witness the miracle. Give a picture of a coin in the right hand and ask Gemini to describe it.


After disappearing the coins, GeminiUltra can summarize all the processes just watched step by step.



Finally, the results are derived based on logical reasoning.


Gemini multi-modal test questions

A Reddit user uploaded a screenshot, which he said was the result of an actual measurement under Gemini.

The picture shows the process of a high school student solving a physics problem. He asked Gemini to correct the problem-solving process, and asked him to give the correct answer if there was an error.

Gemini read out the students’ problem-solving ideas in the picture and successfully answered the questions.


When we gave the same question to GPT-4 ourselves, the first time it tried to answer, it suddenly "caught" halfway through.


When it was asked to answer again, GPT-4 correctly judged that there was a problem with the student's problem-solving ideas and gave the correct answer.


The same situation also happened to Reddit netizens.


GeminiNano brings the war of large models to the mobile phone

The Gemini released by Google this time is not only a response to the "large model SOTA" defined by OpenAI using GPT-4, but also directly brings the war of large models to the mobile terminal. Now the pressure is on Apple's side.

Gemini's three versions of UltraPro Nano are optimized for everything from data centers to mobile phones, and can meet the various needs of different users in different usage scenarios.


GeminiNano is the most efficient model Google has built for tasks on mobile devices. Now it's already running on Google's Pixel 8 Pro.

As the first smartphone designed specifically for Gemini Nano, Pixel 8 Pro leverages the power of Google's artificial intelligence SocTensor G3 to offer two extended features: snippets in Recorder and smart replies in Gboard.

Running GeminiNano locally allows users to keep their sensitive data without leaving their phone and use large model capabilities without a network connection.

In addition to Gemini Nano, which is now running on Pixel 8 Pro, Pixel phones can unlock a more powerful Gemini version through Bard’s smart assistant function in the future.

Summarize in Recorder

Gemini Nano can now perform AI summarization of content in the audio recorder on Pixel 8 Pro.

Users can directly generate summaries of their recorded conversations, interviews, presentations, etc. without being connected to the Internet.



This function can help users sort out the lengthy voice content that they have previously recorded quickly and clearly for further use and organization. It has to be said that it is really convenient.

Smart replies can be made in users’ chats through Gboard

In Pixel8Pro, GeminiNano can support the smart reply function in Gboard.


The AI ​​model on mobile phones is now available for trial on WhatsApp, and more applications will be launched next year, which can use conversation-aware capabilities to propose high-quality replies and save users a lot of time.


The Gemini era is coming

As the leader of Google DeepMind, Demis Hassabis is also very excited and said that "the era of Gemini has arrived."


In the latest interview with Wired, Hassabis said bluntly that the artificial intelligence model Gemini announced by Google today has opened up an unpracticed path for artificial intelligence and may lead to major new breakthroughs.

"As a neuroscientist and computer scientist, I have wanted to try and create a new generation of artificial intelligence models for many years. These models are inspired by the way all of our senses interact and understand the world."

"Gemini is a big step towards this 'multi-modal' model."


He continued, “To date, most models have achieved multimodal capabilities by training separate modules and then stitching them together.”

"For some tasks, this is OK, but in multi-modal space, deep complex reasoning is not possible."

This seems to be an allusion to OpenAI’s technology.

We all know that ChatGPT's multi-modal capabilities are achieved by a combination of GPT-4, DALL·E3, and Whisper models.

At the Google Developer Conference I/O in May this year, Pichai officially announced for the first time that Google is training a new, more powerful PaLM successor named Gemini.


The naming of Gemini also has a deep meaning. It is to commemorate the merger of the two team laboratories of Google Brain and DeepMind, and to pay tribute to NASA Gemini.

In the past 7 months, various revelations about Gemini have emerged one after another.

Now, Google has developed Gemini at an astonishing speed, and it has launched a major counterattack before the end of the year.

Hassabis said the new model's ability to handle different forms of data, including data beyond text, was a key part of the project's vision from the beginning.

Many AI researchers believe that the ability to leverage data in different formats is a key capability of natural intelligence that machines lack.

Large AI models such as ChatGPT have gained flexible and powerful generalization capabilities by learning from powerful Internet data.

But while ChatGPT and similar chatbots (11.880,0.19,1.63%) can use the same skills to discuss or answer questions about the physical world, this superficial understanding can quickly fall apart.


Many artificial intelligence experts believe that significant progress in machine intelligence will require AI systems to be given bodies in physical reality, that is, "embodied."

Hassabis said that Google DeepMind is already studying how to combine Gemini with robotics to physically interact with the world.

"To be truly multimodal, you need to include touch and tactile feedback. There's a lot of promise in applying these basic models to robotics, and we're exploring that vigorously."

Currently, Google has taken a small step in this direction.

In May, the company announced an AI model called Gato that can learn to perform a variety of tasks, including playing Atari games, adding captions to images, and stacking blocks using a robotic arm.

In July this year, the Google RT-2 robot model used language models to help robots understand and perform actions.


In order for an AI agent to be more reliable, the algorithms that power it need to be smarter.

Some time ago, OpenAI was revealed to be developing a project called "Q*". Netizens speculated that "reinforcement learning" may be used, which is the core technology of AlphaGo.

However, Hassabis said that Google is currently conducting research along similar lines.

AlphaGo's advances are expected to help improve planning and inference in future models like the one launched today. We are working on some interesting innovations to bring into future versions of Gemini.

"Next year, you will see Gemini's super evolution."

It seems that, as netizens said, we are not far away from the day when GPT-5 comes.


References:

https://twitter.com/sundarpichai/status/1732414873139589372

https://www.wired.com/story/google-deepmind-demis-hassabis-gemini-ai/