Google's text diffusion experiment
Gemini Diffusion pivots from the standard language model architecture with surprising results
On stage at I/O last week Sundar Pichai called Gemini Diffusion an “experiment.” That is polite understatement. The new model abandons the token-by-token playbook that has defined every large language model since GPT-2 and instead treats language the way Imagen treats pixels: add noise, then learn to remove it. The result? A research prototype that spits out 1.5K tokens, edits itself mid-generation, and still matches Gemini 2.0 Flash-Lite on code accuracy (while beating it on latency by an order of magnitude).
The model, technically
Gemini Diffusion keeps a vanilla Transformer backbone for representation learning but replaces the autoregressive decoder with a discrete denoising scheduler. During inference the model starts from pure noise in embedding space and performs ~16 reverse-diffusion steps to produce a full block of tokens, then optionally repeats for longer contexts (if none of this makes sense to you, hang in there, we’ll get to that later).
Google’s own benchmark shows 1,479 tokens with <1s overhead. That is five-fold faster than Flash-Lite and roughly 40× GPT-4 Turbo’s public numbers. On HumanEval the model hits 89.6% pass @1, statistically tied with Flash-Lite, and it stays within single-digit deltas on every other listed bench except GPQA (science reasoning) where big-context retrieval still matters.
It’s live as of last week as a gated demo and SDK wait-list. Google calls it a research model, but the keynote ran live coding and Gmail reply demos without a hiccup.
How text diffusion works
Traditional diffusion adds Gaussian noise to continuous vectors, trains a network to invert that corruption, then samples by running the inversion in reverse. In layman’s terms: diffusion is teaching the model to clean up a noisy page.
Mess it up. Hide or swap random words so the text looks scrambled.
Fix it. Train the network to guess the original words.
Repeat. Judge success by how much better the cleaned-up page matches the real one.
Image diffusion models have been doing this for years. Gemini’s twist is that the scrambling happens on whole words, not on the raw numbers (vectors) under the hood. That makes the task discrete, not continuous.
Bigger scale. Earlier projects, like Block Diffusion and LLaDA, proved the idea on small models. Gemini pushes past 100B parameters.
Block-wise cleanup. It restores about 128 words at a time, polishing rough drafts inside each block until the sentences flow. Classic left-to-right models fake that with add-ons like beam search.
Self-correction. Every cleanup pass can revisit any word, so a slip-up isn’t final. A conventional model can’t revise once a token hits the cache.
Gemini Diffusion cleans, revises, and polishes text in chunks (quickly, and with built-in error fixing), marking the first large-scale proof that discrete diffusion can outpace the old left-to-right routine.
Why this is different from every LLM you know
As is often mentioned, classic language models behave like an advanced version of your phone’s autocomplete: they guess one word at a time, marching left-to-right and scoring themselves on how wrong each guess is. Gemini Diffusion is nothing like this. It first “smudges” an entire paragraph by hiding or jumbling the words, then teaches itself to restore the clean version in a small, fixed number of polish passes.
Video from @_philschmid
Because every pass can revisit every word, a stray hallucination is just more smudge to wipe away later. No permanent errors. Speed also stops being hostage to length: whether the answer is two sentences or two pages, the model still runs the same dozen-ish cleanup rounds. Context stays coherent without hacks like beam search or speculative decoding.
The payoff scales better, too. Traditional language models get steadily (yet slowly) better as you feed them more data and parameters. Diffusion’s whole-pass cleanup buys almost quadratic improvement: doubling the context or data gives you far more than a modest bump in quality (and, more importantly, it doesn’t double the compute bill).
What this means for the future of AI
Latency war. OpenAI’s GPT-5 reportedly chases five-second response times. Gemini Diffusion did it on stage today. The throughput edge will force every vendor to explore diffusion or hybrid AR-diffusion decoders within the year.
Editing-first workflows. Because the model can refine drafts in-place, IDE plug-ins and doc editors can stream a rough answer instantly and watch the model polish it. This is exactly how image diffusion models like Midjourney do ‘refine’ for images. Expect “live red-line” text.
Training data relief. Diffusion’s inherent noise augmentation reduces reliance on ever larger raw corpora. That matters as the web’s high-quality text well runs dry, a problem the Wall Street Journal flagged last year. Synthetic-on-synthetic collapse is still a risk, but diffusion’s self-correction may delay the Habsburg spiral the way score-distillation rescued image models.
Hybrid horizons. The likely endgame is a two-stage stack: an AR planner to sketch a logical scaffold, followed by diffusion passes for coherence and style. Speculative Diffusion Decoding papers already show 3-4x speed-ups in that hybrid regime.
Research agenda shake-up. Perplexity and log-prob cease to be the gold standards; sequence-level energy scores and denoising variance become the new leaderboard metrics. That will rewrite evaluation suites and may finally kill the obsession with top-k sampling hacks.
Google here has signaled that language modelling is no longer synonymous with autoregression. If the experiment holds up outside the keynote bubble, the diffusion era of text has begun.