OpenAI launches three real-time speech models that can "think", translate and transcribe while listening

OpenAI today released three new real-time speech models, aiming to "unlock a new generation of speech application forms" for developers. These three speech intelligence models focus on different scenario needs such as reasoning dialogue, real-time translation and real-time transcription.

According to information released by OpenAI, the new series includes three models: GPT‑Realtime‑2, GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper. Among them, GPT‑Realtime‑2 is positioned as the first speech model with GPT‑5 level reasoning capabilities, which can better handle complex requests and continue to advance conversations in a more natural way. According to the official introduction, this model is specially built for real-time voice interaction. When users ask questions or issue instructions, they can perform reasoning while maintaining a coherent conversation. At the same time, they can also call tools, handle user interruptions and corrections, and make more appropriate responses based on the current situation.

The second model, GPT‑Realtime‑Translate, focuses on real-time translation capabilities, supporting “more than 70 input languages and 13 output languages” and trying to keep up with the speaker’s speaking speed during the translation process. This feature means that in scenarios such as cross-language calls, meetings, or live broadcasts, this model is expected to provide an experience closer to "simultaneous interpretation".

The third GPT‑Realtime‑Whisper is a real-time streaming speech transcription model focusing on low-latency speech-to-text capabilities. OpenAI said that the model can complete transcription instantly while the speaker is speaking, making various real-time products appear faster, more responsive, and more natural. From live subtitles "speaking while speaking" to meeting records that can keep up with the pace of discussions, such application scenarios are regarded as the main direction of GPT‑Realtime‑Whisper.

In terms of access methods and prices, OpenAI said that the three new speech models have been included in its Realtime API system. GPT‑Realtime‑2 is priced at $32 per 1 million audio input tokens ($0.40 for cached input tokens) and $64 per 1 million audio output tokens. GPT‑Realtime‑Translate is priced at $0.034 per minute, while GPT‑Realtime‑Whisper is priced at $0.017 per minute.

OpenAI said that developers can directly test these new real-time speech models through the Playground. If you already have Codex installed, just click Submit on the designated prompt to add GPT‑Realtime‑2 to an existing application or quickly create a new application based on the model. The official also further introduced the technical details of these three voice models on its website, and how some partner companies have used them in actual products.

In the context of generative AI continuing to evolve towards multi-modality and real-time interaction, the three speech models released by OpenAI are regarded as another important layout in the direction of "voice intelligence". With the unified integration of reasoning, translation and transcription capabilities, developers will be able to more easily provide users with a voice AI experience that is “available at the drop of a hat”. From assistant tools to productivity applications, to content creation and accessibility services, it is expected to usher in a new round of exploration and innovation.