Handy AI

Handy AI

Model Drop: Gemini Omni

Google expands video generation with omnidirectional input and higher fidelity

Jake Handy's avatar
Jake Handy
May 19, 2026
∙ Paid

Today at Google I/O: Gemini Omni, the “create anything from any input” multimodal family Google DeepMind shipped alongside Gemini 3.5 Flash. This is the first model in the Gemini series whose primary output is video instead of text (and a seeming directional change away from Veo).

Model: Gemini Omni (the family). Available today: Gemini Omni Flash. Teased: Omni Pro, no ship date.

Model type: Multimodal input (text, image, audio, video) → video output as the launch capability. Image and audio outputs “in time,” per Google’s own framing. Designed as a single model that collapses the Gemini-intelligence stack and the Veo / Nano Banana generative stack into one.

Ship date: May 19, 2026

Maker: Google DeepMind

Pricing: No per-token or per-second public pricing at launch.

Available on: Gemini app (across AI Plus / Pro / Ultra tiers, globally). Google Flow (Google’s AI creative studio). YouTube Shorts and YouTube Create app (free tier, this week). API on the Gemini API and Vertex AI “in the coming weeks.”

Headline benchmarks: None published. Google did not put Omni on the Artificial Analysis Video Arena leaderboard at launch, where Seedance 2.0 currently leads at 1,269 Elo (text-to-video) and 1,351 Elo (image-to-video), ahead of Kling 3.0, Veo 3.1, and Sora 2. No FVD or human-preference numbers in the launch materials.

Other info: 10-second initial video duration cap (Google says it’s a product decision, not a model limit; longer durations “soon”). All outputs carry an imperceptible SynthID digital watermark, verifiable through the Gemini app, Chrome, and Google Search. Digital avatar feature requires user voice-sample verification before generation. Audio/speech editing capabilities held back from launch “to bring this capability responsibly.” Google ran automated red-teaming on Omni Flash before release; no published system card.

More details: Introducing Gemini Omni (Google blog)

What shipped

Gemini Omni Flash is being framed by Google it as the first model in a new family that collapses generation across modalities into a single system. The pitch from Hassabis on stage was direct: “[Omni] will eventually be able to create any output from any input.” Today that means video out from any combination of text, image, audio, and video in, followed by image and audio outputs from the same model in the near future.

Basically, Google is folding Veo 3.1 and Nano Banana (its generative image model) into a single multimodal generator that also handles editing in the same conversation.

Prompt: A marble rolling fast on a chain reaction style track, continuous smooth shot.

Google only offered demos, no benchmarks. Google showed a chalkboard math proof generated with legible handwritten text (the AI-video text-rendering problem that’s broken every previous model, now apparently solved), a stop-motion claymation explainer about protein folding generated from a single prompt, multi-character cinematic scenes with consistent gaze and timing across cuts, and conversational edits that swap props, change backgrounds, and remove watermarks via plain text.

What’s new

Omni is framed as an architectural reset. The new capabilities don’t have direct analogues in the industry currently.

  • One model, every input modality. Text, image, audio, and video all in. Output is video at launch, image and audio “in time.” This is the first frontier-tier model where image and video generation share a single weight stack instead of two separately-trained models stitched at the product layer.

  • Conversational editing as a first-class output. You don’t render a video and then edit it in a timeline tool, you edit it in the same chat where you generated it. Swap a prop, change the lighting, remove a watermark, switch the character’s outfit, all from text.

  • Text rendering. The chalkboard demo shows clean handwritten text in English, plus rendered text in Chinese, Japanese, and Korean. If Omni’s text rendering holds up in real-world output (and not just the demos Google picked), that’s where the model will shine.

User's avatar

Continue reading this post for free, courtesy of Jake Handy.

Or purchase a paid subscription.
© 2026 Jake Handy · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture