The eye-opening ChatGPT is really eye-opening

Give yourself a holiday. ChatGPT updates so fast, and netizens’ imaginations can’t keep up. On Monday, ChatGPT announced a major update with full multi-modal capabilities. In the future, if you feel there is something wrong with the bike on your way home from get off work on a shared bike, you can take a photo of the part and ask directly.

Then you get home and look at your clueless refrigerator collection, and ChatGPT can tell you which items to pick out for dinner.

After eating and going to bed, if you are still not sleepy, it can also provide you with some ASMR services, if you are tired of hearing those bloggers on Bilibili or YouTube.

In September 1985, Calvino, who wrote "Invisible Cities", died of a sudden stroke. In the summer of this year, he sought help from a doctor because of headaches. The surgeon said that he had never seen such a complex and delicate brain.

ChatGPT started out as an incredibly beautiful brain—and invisible—but now it finally has eyes, ears, and a mouth.

Netizens around the world: Come on, let’s make gestures.

Source: Twitter

Someone tried it, and it can basically do the development of software projects for others.

The birth of a software project is roughly like this: first draw a wireframe on the whiteboard, sort out the arrangement logic, then start writing code, and finally generate the interface. Now in this matter, the work on the whiteboard belongs to you, and leaving the whiteboard belongs to it.

A developer took a photo of his wireframe and threw it to ChatGPT, and it wrote the software directly.

He also played some little tricks, such as replacing the position of the arrangement with irregular arrows. ChatGPT not only saw it, but also accepted it.

We probably still underestimate what multimodality will bring.

The development of artificial intelligence and human intelligence is opposite here. Human beings first have eyes, and after seeing the world, they form language and logic, which in turn can better describe and understand the world they see. The improvement of human intelligence over the past 6 million years has become a giant machine learning furnace.

As for ChatGPT, it already has the best intelligence level and can understand many things. What limits it is the compression of information by text, which makes it unable to access more complex problems. What happens when you give such a brain a pair of eyes. That is to say, it is allowed to see image information directly, and the ability to disassemble problems begins to explode.

Someone fed ChatGPT an interface diagram of a SaaS software and asked it to break it down into small components and write out all the code, which it did.

You can even give it a crude screenshot of Unity's editing interface and ask it to provide a process for adding model actions.

Source: Twitter

After opening up multi-modal capabilities, ChatGPT's understanding and reasoning capabilities have become more intuitive, even a little scary.

Give yourself a minute and see if you can understand the meaning of this set of pictures:

Source: Twitter

This is ChatGPT’s interpretation:

Source: Twitter

"This set of comics seems to emphasize the importance of communication, understanding and alignment in a team." ChatGPT concluded at the end.

This kind of understanding shocked Pietro Schirano, an AI engineer who had worked at Facebook and Uber, and was speechless.

In addition to eyes, there are ears and mouth.

Behind this ChatGPT upgrade, the speech recognition capability is based on the open source Whisper model, and the sound generation capability is based on an additional TTS (text-to-speech) model. Currently, speech synthesis supports five voices, all of which are produced in cooperation with professional voice actors.

But seeing ChatGPT on two mobile phones discussing itself in front of you, about "Has any user tried to sing karaoke with you?" - it is not asking you, it is asking another one - it seems to be a little too ahead of its time.

In addition, it seems to have the potential to be a psychiatrist. Lilian Weng, a member of the OpenAI security team, had a very emotional private conversation with ChatGPT in voice mode, talking about stress and work-life balance.

"The funny thing is, I feel heard and warmed," Lilian Weng said on Twitter. She suggested that if you only use it as a productivity tool, it's better to try its more delicate side.

Source: Twitter

As for the evolution of ChatGPT itself, opening up the multi-modal capabilities that have been trained in 2022 is also establishing a new foundation for future evolution.

ChatGPT chief architect John Schulman said in a podcast a month ago by Pieter Abbeel (John Schulman's mentor when he was focusing on reinforcement learning during his PhD at the University of California, Berkeley) that he felt that the performance improvements brought by existing data and model scaling methods may reach the limit after a period of time. After that, the improvements brought by algorithms, data sets, data set sizes, and computing power will gradually decrease.

"So adding multimodal capabilities will bring huge performance improvements. This allows the model to gain knowledge that cannot be obtained from text and potentially master tasks that pure language models cannot accomplish. For example, models can gain huge benefits from watching videos interacting with the physical world or even with computer screens. All software is designed for humans, and if the model can observe pixels and understand the video, we can use all kinds of existing software or help people use it. Giving the model new capabilities and allowing the model to interact with new things will greatly enhance the actual capabilities of the model."

So what can ChatGPT do next month? Looking forward to it so much.