OpenAI officially announced the launch of multi-modal ChatGPT, which can see, listen and speak

On Monday evening, Beijing time, OpenAI, a well-known startup in the field of artificial intelligence, released a report titled "ChatGPT can now see, listen, and speak" announcement, announcing that this feature will be pushed to paying users in the next two weeks. At the GPT-4 press conference in March this year, the most shocking scene should be that Greg Brockman, the president of OpenAI, took a piece of scratch paper and drew a sketch, took a photo and let GPT-4 generate the code for this website in 10 seconds.

(Source: OpenAI)

ChatGPT has previously launched a "code interpreter" function that can upload images, and has some preliminary capabilities for processing images and text photos. But there is no doubt that today’s “take photos and ask questions” is closer to most users’ AI assistant usage scenarios.

Take a picture of the refrigerator and tell you what to eat tonight

In order of title, there are two main features updated today:Image-based conversations, and real-time voice conversations.

Let’s talk about the picture chat function that has attracted a lot of attention first. According to OpenAI, users can nowTake a photo of your refrigerator and let ChatGPT recommend recipes; while travelingTake a photo of a landmark and let ChatGPT tell you what’s interesting about the place. Of course, you can also take a photo of a math problem and let ChatGPT answer it.

In the official example, ChatGPT is given aA photo of the bike and asked how it turned outSeat lowered. Then ChatGPT said that it depends on the model of your car. Some cars have quick release rods, and some are fixed with bolts, and then gave detailed steps.

Then the official pretended not to understand and took a photo of the bolt.He circled it with the official drawing tool for emphasis, and then asked ChatGPT if it was a quick release lever. ChatGPT saidThis is a bolt, so you need to find an Allen wrench.

Then the official took another photo of the toolbox and asked ChatGPT which wrench it was. ChatGPT also successfully recognized the wrench and prompted the user exactly which size to take.

ChatGPT can talk!

In addition, OpenAI also packages speech recognition, transcription and audio generation functions and launchesAI voice chat function, this function is only available for iOS and Android clients. Officials said that users can use this function to tell bedtime stories to children at home. Or when you are having a meal at home and suddenly get into an argument over a certain issue, you can put ChatGPT on the desktop to resolve the argument.

According to OpenAI, this feature uses the Whisper open source speech recognition system to transcribe what the user says into text. It also uses a new text-to-speech model and works with professional voice actors to provide 5 voices for users to choose.

More advanced AI also has new risks and limitations

OpenAI says its new speech technology is capable of creating realistic synthetic voices from just a few seconds of real speech. This capability opens the door to creativity, but also creates new risks—such as the possibility that criminals may impersonate public figures to commit fraud. So OpenAI’s decision is to launch this feature through specific use cases like “voice chat”.

At the same time, OpenAI is also cooperating with more institutions. For exampleThe streaming company Spotify is trialling this feature for voice translation, helping podcast hosts expand their global reach by using their voices to translate podcast audio into other languages.

Images also bring new challenges, such as hallucination problems and users relying on model interpretation of images in high-risk areas. Therefore, before going online, OpenAI also conducted risk tests in areas such as extremism and scientific capabilities.

In addition, for the Chinese readers who read this article, the experience of picture dialogue is probably worth looking forward to, but the voice dialogue may have to be discounted. OpenAI said,The model is good at transcribing English text, but performs poorly in some other languages, especially those using non-Roman alphabets, and non-English users are advised not to use ChatGPT for such purposes.