Model Drop: GPT-Realtime-2
OpenAI improves realtime voice modeling across speech, translation, and transcription
Model Drop: GPT-Realtime-2
Seventh edition of Model Drop. Today: GPT-Realtime-2 and the two companion voice models OpenAI shipped through the Realtime API on May 7, 2026. First voice-native release in the series, and the first OpenAI launch where the headline model and its sidekicks are all audio-first.
The Specs
Model: GPT-Realtime-2 (gpt-realtime-2 on the OpenAI API), plus two companion models: GPT-Realtime-Translate (gpt-realtime-translate) and GPT-Realtime-Whisper (gpt-realtime-whisper).
Model type: Speech-to-speech multimodal. gpt-realtime-2 accepts text, audio, and image input and outputs text and audio. gpt-realtime-translate is streaming speech-to-speech translation. gpt-realtime-whisper is streaming speech-to-text only.
Ship date: May 7, 2026
Maker: OpenAI
Pricing: gpt-realtime-2 runs $32 / $64 per million audio input / output tokens, with cached audio input at $0.40 per million. Text input / output on the same model is $4 / $24 per million, image input is $5 per million. Roughly $1.15 / $4.61 per hour of conversation at default settings, unchanged from gpt-realtime-1.5. gpt-realtime-translate is $0.034 per minute. gpt-realtime-whisper is $0.017 per minute.
Available on:
OpenAI’s Realtime API over WebSocket, WebRTC, and SIP
gpt-realtime-2also runs on the Chat Completions endpoint and inside the OpenAI Agents SDKChatGPT Voice Mode has NOT been upgraded as of launch
Headline benchmarks: gpt-realtime-2 scores 96.6% on Big Bench Audio, up from 81.4% on gpt-realtime-1.5 (15.2-point jump, new SOTA). On Audio MultiChallenge, it scores 48.5% vs 34.7% on gpt-realtime-1.5 (13.8-point jump). Scale AI’s instruction-retention pass rate on the same suite went from 36.7% to 70.8%. Time-to-first-audio is 1.12s at minimal reasoning effort, 2.33s at high. Launch-partner numbers: Zillow’s voice agent call-success rate jumped from 69% to 95% (26-point lift) after swapping to gpt-realtime-2; Glean reports a 42.9% relative helpfulness lift; Genspark reports a 26% increase in effective conversation rate. gpt-realtime-whisper claims ~90% fewer hallucinations than Whisper v2 and ~70% fewer than gpt-4o-transcribe. gpt-realtime-translate reports a 12.5% word error rate reduction on Hindi, Tamil, and Telugu versus the prior speech-translation stack, per BolnaAI’s deployment.
Other info: 128K context window (up from 32K on gpt-realtime-1.5), 32K max output. Five reasoning effort levels: minimal, low, medium, high, xhigh (low is the default). Knowledge cutoff September 30, 2024. Two new exclusive voices in the Realtime API, Cedar and Marin, plus a quality refresh on the eight legacy voices (alloy, ash, ballad, coral, echo, sage, shimmer, verse). Parallel tool calls with audible preambles (”one moment, checking your calendar”) so the model can narrate while it works. gpt-realtime-translate covers 70+ input languages and 13 output languages with simultaneous interpretation. gpt-realtime-whisper is streaming-only and ships against the same 70+ language footprint. No fine-tuning, no predicted outputs, no text streaming on Chat Completions yet. System card published with the launch; safety classification stays at Medium under OpenAI’s Preparedness Framework. No watermarking on output audio.
More details: Advancing voice intelligence with new models in the API (OpenAI)
What shipped
OpenAI released three new voice models on Thursday, May 7, and pitched the headline one as “our first voice model with GPT-5-class reasoning that can handle harder requests and carry the conversation forward naturally.” gpt-realtime-2 is a native speech-to-speech model, not a stitched pipeline of separate STT, LLM, and TTS components. It quadruples the context window (32K to 128K), adds the same five reasoning effort levels OpenAI shipped on GPT-5.5 three weeks ago, and gains parallel tool calls with spoken preambles so the agent can say “let me pull that up” out loud while it executes a function. Alongside gpt-realtime-2, OpenAI shipped gpt-realtime-translate for simultaneous live translation across 70+ input languages, and gpt-realtime-whisper for streaming low-latency transcription. All three are live in the Realtime API today over WebSocket, WebRTC, and SIP, which means a developer can point the model at a real phone number and let it pick up.
The pitch the benchmark card is built to support: voice agents that can actually reason while they talk, not the 2024 trade-off between fluency and intelligence. Big Bench Audio at 96.6% is a 15.2-point jump on gpt-realtime-1.5 and the new SOTA for native speech-to-speech reasoning. Audio MultiChallenge at 48.5% (up from 34.7%) measures multi-turn instruction following under realistic speech conditions; Scale AI’s pass rate on the same suite nearly doubled. The launch-partner case studies are unusually load-bearing for an OpenAI release: Zillow’s call-success rate going from 69% to 95% is a 26-point swing on a production deployment, and Glean’s 42.9% helpfulness lift is the kind of number procurement teams will quote in renewal conversations. The visible catches: pricing per minute is identical to gpt-realtime-1.5 (~$1.15 / $4.61 per hour input / output), so anyone hoping the model would get cheaper as it got smarter is going to wait. ChatGPT Voice Mode has NOT been upgraded to gpt-realtime-2 at launch, which Simon Willison flagged within hours. The model also still doesn’t support fine-tuning, predicted outputs, or text streaming on Chat Completions, which limits the production patterns developers can build around it on day one.
What’s new
gpt-realtime-2 is a clean break from the audio model that came before it. The prior generation was a speech wrapper around GPT-4o-era reasoning; this one is built around the GPT-5 family’s reasoning stack with a 4x context expansion and a new audio decoder.
GPT-5-class reasoning in a voice loop. This is the first OpenAI voice model that exposes the same reasoning effort levels (minimal, low, medium, high, xhigh) the text models have shipped with since GPT-5. Big Bench Audio at 96.6% is the headline, but the load-bearing change is that the model can reason about ambiguous user requests during a phone call without the latency-vs-intelligence tradeoff that has hobbled voice agents for two years. Time-to-first-audio runs 1.12s at minimal effort and 2.33s at high.
Parallel tool calls with spoken preambles. The model can call multiple tools in parallel mid-conversation and narrate what it’s doing while it does (”one moment, I’m checking your account and pulling up the schedule”). That kills the dead-air problem that made every prior voice agent feel broken. Combined with native MCP server support and the Agents SDK, it’s the first OpenAI voice release that’s actually wired for production agent work, not demos.
128K context window in a session. Up from 32K on
gpt-realtime-1.5. A four-hour customer-service call can sit in context without the model losing the early turns. Sam Altman noted on the launch that voice has become the modality users reach for when they need to “dump” a lot of context into a system, and the 4x context expansion is the architectural answer.Two new voices, Cedar and Marin, exclusive to the Realtime API. The eight legacy voices got a quality refresh on the new audio stack. Cedar is the warm mid-range male voice; Marin is a brighter female voice. Both ship at the same per-minute price as the rest of the lineup.
A live translation model and a separate streaming Whisper model.
gpt-realtime-translatedoes simultaneous interpretation across 70+ input languages and 13 output languages at $0.034 per minute, and BolnaAI reported a 12.5% word-error-rate reduction on Hindi, Tamil, and Telugu versus their prior stack.gpt-realtime-whisperis a streaming speech-to-text model at $0.017 per minute, with claimed ~90% fewer hallucinations than Whisper v2 and ~70% fewer thangpt-4o-transcribe. Splitting the workloads (S2S, translation, STT) into three purpose-built models is new for OpenAI; prior generations crammed all three into one endpoint.
How and where to use it
Where it’s available:
Realtime API over WebSocket, WebRTC, and SIP for all three models
gpt-realtime-2also exposes Chat Completions for non-streaming use (no text streaming yet) and runs natively inside the OpenAI Agents SDK with remote MCP server support
What it’s good at:
Production voice agents that handle real customer-service workflows where the model has to reason, call tools, and not lose the thread of a multi-turn call.
Phone-number-attached agents over SIP (this is the first OpenAI release where you can wire the model to a real phone line without a third-party bridge).
Multilingual support workflows where the agent has to swap languages mid-call or interpret simultaneously (Translate covers 70+ input languages).
Real-time transcription where Whisper-v2-style hallucinations would have been disqualifying (medical scribing, legal depositions, courtroom captioning, classroom captioning).
Long calls that need full context retention (the 128K window is the architectural unlock for four-hour technical-support calls or therapy intake sessions where the early turns matter).
Domain-specific voice agents where terminology retention used to break the prior models, per OpenAI’s tone-and-jargon adjustment claims.
What it’s bad at / shouldn’t be used for:
Anything where price-per-minute is the binding constraint, because pricing did not move from
gpt-realtime-1.5even though intelligence jumped. At $1.15 / $4.61 per hour, a 20-minute support call on the model runs roughly $0.40 to $0.80 in inference alone, which is fine for high-margin verticals (real estate, fintech, healthcare) and prohibitive for low-margin ones (food delivery, ride-hailing, retail self-service).Consumer-facing voice apps that expect ChatGPT-app-grade quality, because Voice Mode itself isn’t on
gpt-realtime-2yet and won’t be for an indeterminate window.Workloads that need fine-tuning, predicted outputs, or text streaming on Chat Completions, because none of those are supported.
Languages outside the 13-output set on Translate, where you’ll fall back to the older speech-translation stack.
Production transcription where consistency across long sessions matters; early developer reports flag the model unexpectedly switching to other languages mid-stream.
Anything safety-critical without belt-and-suspenders monitoring on accidental wake-up and barge-in behavior, which the system card flags as a known voice-specific risk.
First impressions
Kyle Windland built a multi-tool agent on gpt-realtime-2 within hours of launch and posted what is probably the cleanest practitioner take on the Latent Space rundown:
“First OpenAI speech-to-speech model good enough for real work.”
That framing matters because Windland builds production voice agents for a living, and the prior generations got dismissed as demoware. The benchmark gain from 81.4% to 96.6% on Big Bench Audio is the kind of jump that maps to “the agent stops failing on the third turn,” and the parallel-tool-calls-plus-preamble pattern is what closes the gap between voice-demo quality and voice-product quality. If practitioners are saying this on day one, it’s a real shift.
Artificial Analysis adjacent to the launch picked up the cost angle Vignesh Bhat called “Total realtime victory“ on X:
Pricing per minute didn’t change even though Big Bench Audio jumped 15 points, Audio MultiChallenge jumped 14 points, and the context window quadrupled. On a per-minute basis, gpt-realtime-2 at default settings is a meaningful intelligence-per-dollar lift over gpt-realtime-1.5. For the voice-agent market that’s been waiting for a sub-$5-per-hour speech-to-speech model that can actually reason, this is the release.
The MarkTechPost writeup framed the three-model split as the structural change, not the benchmark numbers:
“Splitting the workloads (speech-to-speech, translation, streaming STT) into three purpose-built models lets each one specialize.”
Specialization is the move every other API provider has been waiting for OpenAI to make. Gemini Live API has been one-model-fits-all since launch; ElevenLabs ships separate models per workload and dominates on TTS quality. By splitting Translate and Whisper out, OpenAI gives developers a per-workload best-tool option and undercuts ElevenLabs Scribe and Deepgram Nova-3 on per-minute pricing in one move.
Jake’s take
The 26-point call-success swing Zillow reported (69% to 95%) is the kind of number that, if it generalizes even half as well to your domain, kills the case for staffing first-line voice support with people. Same with Glean’s 42.9% helpfulness lift on internal support questions. The parallel-tool-calls-with-preambles pattern is a true architectural unlock. The agent saying “one moment, checking your appointments” out loud while it executes a function means these will start feeling like an actual agent and and less like a chatbot reading a script.
Unfortunately ChatGPT Voice Mode is not on gpt-realtime-2 at launch, which means the hundred-plus million people who actually experience OpenAI voice every day get nothing (for now). The pricing didn’t move, which is the OpenAI signature on every recent release: smarter at the same price (never cheaper). For low-margin voice workloads (food delivery support, retail self-service, ride-hailing), Deepgram Nova-3 and ElevenLabs Conversational AI are still the cost-floor options.
My read: gpt-realtime-2 is the new default for English-language production voice agents in high-margin verticals, and the GPT-Realtime-Translate model is going to eat ElevenLabs Scribe’s lunch in live-translation workflows. gpt-realtime-whisper is a sidegrade. The consumer story is on hold until Voice Mode catches up.






