OpenAI develops a two-way speech model: it can respond instantly even if interrupted, making calls more natural and smooth

According to media reports,OpenAI is developing a new speech model designed to make users' conversations with ChatGPT more natural and smooth. The core breakthrough of this technology is that when the user interrupts the system while the AI is speaking, the AI can adjust the response in real time instead of stopping suddenly like now.

Currently, ChatGPT's advanced voice mode uses a turn-based dialogue mechanism. The user must finish speaking before AI will process the voice and generate an answer. If the user inserts a short response such as "okay" or "mm-hm" when the AI is speaking, the system will usually stop directly and cannot continue the communication like a normal conversation.

To solve this problem,The BiDi (bidirectional speech model) being developed by OpenAI continuously processes the speaker's speech input, so it can immediately adjust its response when interrupted.In contrast, once existing speech models start generating answers, the output content is basically fixed and cannot change based on new input.

This technology is still in the development stage. According to people familiar with the matter, the prototype model was prone to glitches and sometimes even made unnatural sounds after several minutes of sustained conversation. OpenAI researchers originally hoped to release BiDi in the first quarter of this year, but the latest release may be postponed to the second quarter or later.

OpenAI believes that if the speech model can approach the text model in performance, the scope of AI use will be further expanded, because most people are more accustomed to voice communication with AI rather than entering text. The BiDi model may be particularly valuable in customer service scenarios.

For example, when a customer talks to a retailer's AI customer service, if the customer temporarily decides to exchange the product instead of returning it during the conversation, the BiDi model can theoretically allow the AI customer service to adjust the conversation smoothly without sudden stops or confusion.

People familiar with the matter also revealed that the BiDi model is also more flexible in calling external tools and applications.OpenAI previously stated that the company plans to improve the voice model for a future AI device that mainly interacts through voice, and is considering developing a smart speaker that can check emails or book services through voice commands.