OpenAI dropped three new realtime audio models into the API in a single release: GPT-Realtime-2, the company's first voice model with GPT-5-class reasoning; GPT-Realtime-Translate, a live speech-to-speech translator across 70+ input languages and 13 output languages; and GPT-Realtime-Whisper, a streaming speech-to-text model built for low-latency captions and transcription. Together they push the Realtime API past simple turn-taking voice bots and into voice agents that can actually do work while a conversation is happening.
GPT-Realtime-2: Reasoning Comes to Voice
The headline model is GPT-Realtime-2, and the headline change is reasoning. Until now, the realtime voice stack felt fast but shallow—great at carrying a conversation, weak at handling a request that required a second thought. GPT-Realtime-2 closes that gap with adjustable reasoning effort across minimal, low, medium, high, and xhigh tiers, mirroring the dial OpenAI ships on the text models. Default is low, so simple chat stays cheap and snappy; xhigh is on tap for the calls where the model needs to actually think.
The model also widens the conversational substrate around that reasoning. The context window grows from 32K to 128K tokens, enough for long sessions and complex agentic workflows. Parallel tool calls and audible "tool transparency" let the agent say things like "checking your calendar" or "looking that up now" while it's actually doing the work. New preambles—short phrases like "let me check that" or "one moment while I look into it"—keep the user oriented during longer thinking. And there's stronger graceful-failure behavior: instead of failing silently, the model can now say "I'm having trouble with that right now" and recover the conversation.
The benchmark deltas back the framing. GPT-Realtime-2 (high) scores 15.2% higher on Big Bench Audio than GPT-Realtime-1.5, and (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following. Zillow, an early tester, reported a 26-point lift in call success rate on its hardest adversarial benchmark after prompt optimization (95% vs. 69%) and called out the agentic competence and guardrail strength as the things that finally made it viable for production voice.
GPT-Realtime-Translate: Live Multilingual Conversations
The second model is built for one job: speech-to-speech translation that keeps pace with a live speaker. 70+ input languages, 13 output languages, with simultaneous transcription so the user can also read along. The use cases OpenAI is pitching are obvious—customer support, cross-border sales, education, events, media, creator platforms with global audiences—and the early-tester quotes are sharp on the latency story: BolnaAI reported 12.5% lower Word Error Rates than any other model they tested across Hindi, Tamil, and Telugu, with sustained natural conversation latency.
Vimeo demoed the model translating a product education video live as it played, so global customers could hear updates in their preferred language without waiting for a separately produced version. Deutsche Telekom is using it for multilingual customer support so callers can speak in the language they're most comfortable in while the model translates both directions in real time.
GPT-Realtime-Whisper: Streaming Transcription
The third model is a pure streaming speech-to-text engine optimized for low-latency captions and live transcripts. The targeted workflows: meeting captions, classroom captions, live broadcasts, event captions, in-progress meeting notes, and the ASR layer underneath voice agents that need to understand the user continuously rather than waiting for end-of-turn. It's the unsexy plumbing model in the lineup, and it's the one most existing Whisper-based pipelines will probably swap in first.
Three New Voice Patterns
OpenAI framed the launch around three patterns it's seeing developers build. Voice-to-action: people describe what they need and the agent reasons through it, calls tools, and completes the task—Zillow's example was "find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday." Systems-to-voice: software turns context into proactive spoken guidance—a travel app saying "your flight is delayed but you can still make your connection, here's the new gate." Voice-to-voice: conversations continue across languages and changing context. Priceline is working toward a future where travelers manage entire trips by voice, including searching, rebooking, and translating once on the ground.
Pricing and Availability
All three models are live in the Realtime API today. GPT-Realtime-2: $32 per 1M audio input tokens ($0.40 cached) and $64 per 1M audio output tokens. GPT-Realtime-Translate: $0.034 per minute. GPT-Realtime-Whisper: $0.017 per minute. The Realtime API supports EU Data Residency and is covered by OpenAI's enterprise privacy commitments. There's a Playground for hands-on testing and a Codex prompt that scaffolds a minimal WebRTC voice agent against the new model in one click.
Why It Matters for Web Developers
Voice as an interface has been the perpetual "next year" of consumer software for a decade, mostly because the underlying models couldn't do what the demos promised. GPT-Realtime-2 is the first version of this stack where the agent can hold a conversation, reason through a multi-step request, run tools in parallel, recover from interruptions, and modulate its tone—all without dropping out of the realtime loop. For web developers building support bots, scheduling agents, accessibility tools, education platforms, or any UI that benefits from being faster to talk to than to click through, this is the version of the API that's worth a real prototype.