Most of OpenAI's changes to ChatGPT involve the capabilities of the AI bot: the questions it can answer, the information it can access, and improved underlying models. This time, though, it tweaks the way you use ChatGPT itself. The company is launching a new version of its service that lets you prompt an AI bot not just by typing sentences into a text box, but also by speaking out loud or uploading a picture.
According to OpenAI, the new features will be rolled out to users who pay for ChatGPT within the next two weeks, and other users will also get the new features "soon after."
The voice chat part feels very familiar: You click a button and speak your question, ChatGPT converts it to text and feeds it into a large language model, which then gets the answer, converts it to speech, and speaks the answer out loud. It feels like talking to Alexa or Google Assistant, only OpenAI hopes that the answers will be better thanks to improvements in the underlying technology. Most virtual assistants seem to be relying on LLM for their transformation, but OpenAI is leading the way.
OpenAI's excellent Whisper model does a lot of the speech-to-text work, and the company is launching a new text-to-speech model that it says can "generate human-like audio from text and a few seconds of speech samples." You can choose ChatGPT's voice from five options, but OpenAI seems to think the model has potential for much more than that. For example, OpenAI is working with Spotify to translate podcasts into other languages while maintaining the podcast's voice. There are many interesting uses for synthesized speech, and OpenAI could become an important part of the industry.
But the fact that it takes just a few seconds of audio to build a capable synthetic voice also opens the door to a variety of problematic use cases. "These features also bring new risks, such as the possibility of malicious actors impersonating public figures or committing fraud. It is for this reason that OpenAI is not using this model broadly: it will be more tightly controlled and limited to specific use cases and partnerships," the company said in a blog post announcing the new features.
Image search, meanwhile, is a bit like Google Lens. You just take a photo of what interests you and ChatGPT will figure out what your problem is and respond accordingly. You can also use the app's drawing tools to help express your questions clearly, or speak or type questions based on pictures. This is where the back-and-forth nature of ChatGPT helps: you can prompt the bot and refine your answer at the same time, rather than searching first and then searching again after getting the wrong answer. (This is very similar to what Google does with multimodal search).
Obviously, image search also has its potential problems. One is what might happen when you prompt a chatbot with a person: OpenAI says it intentionally limits ChatGPT's "ability to analyze and directly state people" due to accuracy and privacy concerns. That means one of the most sci-fi visions of artificial intelligence—the ability to look at a person and say, "Who's that?"—won't be coming to fruition anytime soon. And maybe that's a good thing.
Nearly a year after ChatGPT was first released, it seems OpenAI is still trying to figure out how to provide more features and capabilities for its bot without introducing new problems and drawbacks. In these versions, the company has tried to achieve this by deliberately limiting the functionality of its new models. But this approach won't work forever. As more and more people use voice control and image search, and as ChatGPT becomes a truly multimodal, useful virtual assistant, it will become increasingly difficult to maintain guardrails.