Handy AI

GPT-5.6 rumors stir as Anthropic pushes a more “honest” Opus 4.8

Jake Handy — Mon, 01 Jun 2026 15:21:16 GMT

🤔 “Why are there so many new models?”

Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence and make the right decisions.

Share Handy AI

what to know for now

🤖 GPT-5.6 is leaking out of OpenAI’s own logs ahead of a rumored June launch. A gpt-5.6 identifier briefly surfaced in Codex traces, internal codenames (ember-alpha, beacon-alpha, iris-alpha) are floating around, and the chatter points to a 1.5M-token context window aimed at full codebases and book-length documents. Nothing is official, no model card, no benchmarks, but prediction markets put a public release before June 30 at 80 to 89%. Read more

🧠 Anthropic shipped Claude Opus 4.8. The May 28 release pushes agentic coding from 64.3% to 69.2%, multidisciplinary reasoning from 54.7% to 57.9%, and knowledge work from 1753 to 1890, all at the same price as 4.7. The headline features are effort control on claude.ai, dynamic workflows in Claude Code that spin up parallel subagents for large jobs, and a fast mode that runs 2.5x faster and three times cheaper than before. Read more

💰 Anthropic raised $65B at a $965B valuation and is now the most valuable AI startup on earth. The Series H, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, nearly triples February’s $380B mark and edges past OpenAI’s $852B. The justification is a $47B revenue run rate, up from $30B earlier this year, powered mostly by Claude Code. Read more

🏛️ OpenAI finished its recapitalization, and the nonprofit it spent a year trying to escape now controls the whole thing. The OpenAI Foundation sits on top of the public-benefit company, holds roughly 26% of the equity (about $130B today), and Altman wants it to become the largest nonprofit in history. It opens with a $25B commitment to health and curing disease plus AI resilience, and plans to spend at least $1B over the next year across life sciences, jobs, and community programs. Read more

🧬 Chan Zuckerberg Biohub released a protein “world model” that designs new drug molecules in hours instead of years. Built on the fourth generation of evolutionary scale modeling (ESM), the system predicts protein structure, maps how proteins behave, and designs fresh binders that lock onto specific targets. In lab tests, binders it designed for cancer and immune targets actually reactivated immune cells. Read more

📉 Uber’s COO said the quiet part out loud: the AI bill is real, the payoff is not yet. Andrew Macdonald admitted it’s “very hard to draw a line” between rising token spend and shipped customer value, after the company burned its entire 2026 Claude Code and Cursor budget in four months. The numbers are wild in both directions: 95% of Uber engineers touch AI tools monthly and 70% of committed code is AI-generated, yet none of it maps cleanly to product wins. Read more

🧪 AI Research of the Week

Leveraging Large Language Models to Improve Precision in Randomized Controlled Trials
From University of Michigan / WPI

Jake’s Take: Randomized controlled trials are the gold standard for proving a drug or intervention works, and they are typically crazy expensive because you need a lot of participants to see a signal through all the noise. This paper feeds LLM predictions as an additional layer on top of this analysis (not as a replacement for the real trial data), which seems to tighten the estimate of the treatment effect. It works best exactly where trials have struggled, in studies that lack good predictive variables or that lean on messy text data (the model can read better than a regression can).

The so-what is sample size. If an LLM-assisted analysis squeezes the same statistical confidence out of fewer patients, you can run trials that were previously too small, too rare, or too costly to attempt, which is the whole bottleneck for rare-disease and under-funded research. It pairs neatly with other stories from this week: Biohub designing molecules and OpenAI’s foundation pledging $25B at disease both assume the testing pipeline can keep up, and this is one way it does.

what to know for later

🔁 The ex-DeepMind crowd is racing to build AI that improves itself without human data. David Silver, the AlphaZero architect, left DeepMind and raised $1.1B at a $5.1B valuation for Ineffable Intelligence, a “superlearner” meant to discover knowledge through reinforcement learning rather than human examples, backed by Sequoia, Lightspeed, Nvidia, and Google. A parallel outfit, Recursive Superintelligence, pulled $500M to build a model that finds its own weaknesses and rewrites itself. Read more

Model Drop: Claude Opus 4.8

Jake Handy — Thu, 28 May 2026 18:22:40 GMT

Anthropic dropped Claude Opus 4.8 an hour ago, and on paper it’s the kind of release that’s easy to shrug at. Benchmarks tick up. Price stays flat ($5 per million input tokens, $25 per million out, same as 4.7). Anthropic’s own post calls it “a modest but tangible improvement on its predecessor,” which is refreshingly honest. Nobody’s claiming this thing cures cancer.

But: Opus 4.8 is roughly four times less likely than 4.7 to let a flaw in its own code pass without flagging it, and for those familiar with Claude that’s the massive upgrade.

Model: Claude Opus 4.8 (claude-opus-4-8 on the Claude API). Effort levels: high (default), xhigh (extra), and max. Adaptive thinking is the only thinking mode.

Model type: Text + vision input, text output. Same multimodal input stack as Opus 4.7. No native image, audio, or video output.

Ship date: May 28, 2026

Maker: Anthropic

Pricing: $5 / $25 per million input / output tokens, flat from Opus 4.7. Fast mode (research preview on the API) runs up to 2.5x output tokens per second at $10 / $50 per million, which Anthropic says is roughly three times cheaper than fast inference on previous models.

Available on: claude.ai, Claude Code (Team, Enterprise, and Max plans for the dynamic-workflows preview), Cowork, the Claude API, Amazon Bedrock, Google Vertex AI, and Microsoft Foundry (200k context there). GitHub Copilot added it the same morning for Pro+, Business, and Enterprise (15x premium multiplier until usage-based billing lands June 1), and Cursor shipped it in the model picker at launch.

Headline benchmarks: SWE-Bench Pro at 69.2% (Opus 4.7: 64.3%, GPT-5.5: 58.6%, Gemini 3.1 Pro: 54.2%), the agentic-coding lead. OSWorld-Verified computer use at 83.4% (Opus 4.7: 82.8%, GPT-5.5: 78.7%). GDPval-AA knowledge work at 1890 Elo (GPT-5.5: 1769, Opus 4.7: 1753). Online-Mind2Web web agents at 84%, what Anthropic calls a meaningful jump over both Opus 4.7 and GPT-5.5. Where it trails: Terminal-Bench 2.1 at 74.6%, behind GPT-5.5’s 78.2% (still well ahead of Opus 4.7’s 66.1%). Humanity’s Last Exam at 49.8% without tools, 57.9% with.

Other info: 1M token context window by default on the API, Bedrock, and Vertex (200k on Microsoft Foundry), 128k max output. Adaptive thinking only, no extended-thinking budgets. Trained to flag uncertainty in its own work and roughly four times less likely than 4.7 to let a flaw in code it wrote pass without comment. Alignment assessment hits new highs on prosocial traits with misaligned-behavior rates close to Claude Mythos Preview, Anthropic’s best-aligned model. Full system card published with pre-deployment safety testing (Anthropic has deployed every Opus 4.x model under its ASL-3 standard).

More details: Introducing Claude Opus 4.8 (Anthropic)

Thanks for reading Handy AI! This post is public so feel free to share it.

What shipped

Anthropic released Claude Opus 4.8 on Thursday morning and, in a move that’s almost disorienting from a model lab, undersold it. The post calls it “a modest but tangible improvement on its predecessor.” Price holds flat at $5 / $25 per million tokens, the architecture and tool surface carry over from 4.7, and the benchmark gains are real but incremental. The story Anthropic actually leans on is behavioral: Opus 4.8 is trained to flag uncertainty about its own work, push back on plans that don’t hold up, and stop claiming progress it can’t support. The headline number is that it’s around four times less likely than 4.7 to let a flaw in code it produced slip by unremarked. Ben Sherry at Inc. ran with the framing the launch was built to earn, calling it Anthropic’s “most honest” model yet.

What’s new

Opus 4.8 reads as a post-training and behavior upgrade on the 4.7 base rather than a new model from scratch, so the benchmark deltas are narrow. What changed is how it behaves under load and what you can do with it, and five things stand out.

It tells you when it’s not sure. Anthropic trained 4.8 to surface uncertainty about its own output instead of papering over it, and it’s roughly four times less likely than 4.7 to let a flaw in code it wrote pass without flagging it. The same training shows up in the alignment numbers, where misaligned behavior lands near Mythos Preview (Anthropic’s “best-behaved “ model).
Effort is a dial you control. Effort defaults to high on every surface, including Claude Code, with xhigh and max available for hard problems and long async runs. Anthropic says high spends about the same tokens as 4.7’s default on coding tasks while scoring better, with adaptive thinking deciding per turn whether to reason at all.

Don't let your CFO cancel your AI

Jake Handy — Thu, 28 May 2026 14:44:38 GMT

The frontier AI subsidy is ending. Microsoft just canceled a chunk of Claude Code licenses and Uber is questioning whether AI is worth it after burning through its entire 2026 budget in four month. Both companies got a look at their renewal numbers ahead of you and stepped back from the edge, and both companies are looking to resolve this via abandonment. This is the wrong move.

The cliff is real and the math is rather unforgiving. We’re going to see quite a few more companies either adjust course or abandon the efforts entirely. But the people that are smart about it are not only going to be fine, but have a pretty significant advantage over these companies that cut loose after letting overuse run rampant.

The key is cheaper models and smarter strategies (and to get your staff to quit tokenmaxxing, for Pete’s sake).

The subsidies are running out

Anthropic ships Opus 4.7 at $5 / $25 per million input / output tokens, with a bizarre tokenizer change that made the exact same prompts consume 1.0-1.35x more tokens than Opus 4.6 (which release just two months prior).

GPT-5.5 launched at $5 / $30, a massive 2x jump from GPT-5.4. The Pro version of 5.5 sits at $30 / $180 (so forget about that).

ChatGPT Pro and Claude Max are both $200 a month, and Sam Altman has already said Pro is a money-loser. The prices are subsidized. The Information has OpenAI losing $5 billion in 2024 and projects $14 billion in 2026 and $44 billion by 2028. Anthropic raised $13 billion at a $183 billion valuation last September to keep the inference layer running. Every Claude Code session, every Codex run, every Cowork operation you launched this week was paid for in part by venture capital and Microsoft’s balance sheet, not by your subscription.

The walk-back has started (and later than I expected, to be honest). Anthropic moved Claude Code and the Claude Agent SDK off “included with Claude Max” and onto metered API credits in late April. GitHub Copilot flips to usage-based billing on June 1. Cursor raised its Pro seat in March. The flat-rate enterprise agreement your CFO signed in 2024 is going to reprice against actual inference cost in 2026 and 2027, and pooled-credit allowances are getting cut.

Both Microsoft and Uber companies ran the same math twelve months earlier than the rest of the market and landed on the same place. The two enterprises with the clearest view of frontier inference cost both told their orgs to slow down inside the same quarter. If you’re waiting for a more obvious signal to move, you missed it.

The cheap models got good

While the Anthropic and OpenAI behemoths pushed headlines and budgets, the little guys underneath have slowly caught up. A million output tokens a day on GPT-5.5 costs $30. The same tokens on Kimi K2.6 cost $2.50 (Cursor’s Composer 2.5, a wildly impressive model that I’ll talk about more later, doesn’t publish their exact pricing but is likely similar).

That’s $10,000 a year per engineer on a single line item. Kimi, DeepSeek, and GLM are open-weight, so you can run them in your own VPC or on-prem when compliance or cost demands it. The cheap models also burn a third of the kilowatt-hours and water of the same job on Opus, which is the environmental side of the same procurement argument (a vital argument that is only going to get more loud in the coming year).

Opus still holds SWE-Bench Pro and GPT-5.5 still leads FrontierMath Tier 4. But this gap covers maybe 10% of the work your team does, and I’m being generous. The other 90% runs the same on Kimi, DeepSeek, GLM, or Composer for an eighth of the price.

Your daily driver and your standby tier

I don’t want the takeaway here to be either “buy the enterprise frontier contract” or “abandon AI.” The former is becoming unreasonable and the latter will put you behind the corps with cash.

It’s tiering, resulting in less than $50 a month per person.

Daily driver: a Cursor seat at $20 a month, running Composer 2.5 as the default in-editor model. Cursor 3 shipped a simplified surface for casual users. Non-engineers can use the editor for non-code work without learning vim or knowing what a buffer is, and the unified-agents view puts cloud runs and editor sessions on the same panel. Composer 2.5 is fast, responsive, and built for iterative work. I’ve been using it as my default for the past two weeks and I am blown away by what it can do. There have been times where I’ve churned through 20 bucks of Opus on a problem with no success, only to swing in with some rapid Composer 2.5 iteration and get it solved with pennies. I don’t know what sort of voodoo Cursor did with this model, but its working.
Standby tier: a $20 Claude Code or Codex Plus subscription, in your pocket for the few times a week the work goes frontier-shaped. Hard architectural refactors. Long-horizon agentic runs. Research-flavored math problems. Truth-critical legal or financial review. You don’t need an annual enterprise Anthropic agreement for these. You need access on the days you need it and billed in a way that won’t scare your finance team.

I think the tiering above, with some model variations, will become standard. So much so that I wouldn’t be surprised if Anthropic and OpenAI see the writing on the wall and try to whip up their own version of Composer 2.5

The is the new AI stack for engineers and PMs.

“I’m a leader at my company. What do I do?”

Get the full bill. Per-seat Copilot. Anthropic and OpenAI API. Cursor. Codex. ChatGPT Enterprise. Notion AI. Glean. Jasper. Hubspot AI add-ons. Bedrock. Vertex. Sum it. Most leaders cannot answer this question, which is exactly why the line is growing.
Run two A/Bs. The biggest engineering workload and the biggest non-engineering workload. Two weeks each on Composer 2.5, Kimi K2.6, DeepSeek V4, or GLM-4.6. Measure shipped output. People never measure output properly and its infuriating.
Tier your model menu. Cheap or in-editor model is the default. Frontier requires written justification or purposeful restrictions.
Pilot self-hosted open-weight by Q4. Kimi K2.6 or DeepSeek V4 on internal infra. This is just to have the option ready when the renewal hits, not to necessarily mandate it. This kind of stuff takes time to set up.
Pull your renewal forward. If your enterprise agreement with Anthropic or OpenAI renews in 2026, the new number probably won’t be pretty. Renegotiate now or migrate part of the workload off-frontier before the cliff.

“I don’t want to lose these tools. How do I convince my company to keep them?”

Walk in with the bill. Pitch a specific number, not a moving model target. “We spent X on Opus. Same workload on Composer at Y. Delta is Z.” Numbers are what will convince CFOs (and make you look smart).
Run a two-person, two-week pilot. On a non-critical workload, please. Then write up a one-page memo with the numbers.
Flip the risk frame. The risk isn’t “what if the cheap model is worse.” The risk is “what if our competitor figured this out three months before we did.” The business advantage in this next era of AI pricing is figuring out the balance between maximum intelligence at affordable and sustainable pricing.
Argue workflow fit, not benchmarks. Composer isn’t better than Opus on the leaderboard, but it’s likely better for the work you do. We don’t use industrial ovens to bake birthday cakes.
Name your escalation path upfront. Frontier still wins frontier work, throwing smaller models at huge repo-wide tasks will cause even more technical debt. This credibility is what makes the rest of the pitch land (and makes your boss not feel like they’ve been driving a sinking ship since onboarding Claude Code in January).

The bet

Don’t abandon these tools. The companies and people who learn to use them well will spend the next two years running AI at a third of the cost of the ones still buying the tokenmaxxing leaderboard. Strive to ship faster on the same budget, and keep frontier on standby for the days its needed. The ones who ride the cliff down to the renewal repricing without changing anything are going to look up in twelve months and realize their competitor is doing more work for less money on the same playing field.

Composer in the editor. Claude Code or Codex in your pocket. Cheap by default. Good for your wallet (and better for the planet).

Pace yourself. Or somebody else will pace you.

New Gemini from Google and new Composer from Cursor; plus, the Pope wants you to take AI seriously

Jake Handy — Mon, 25 May 2026 14:15:24 GMT

🤔 “The Pope is doing what now?”

Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence and make the right decisions.

Share Handy AI

what to know for now

🤖 Google I/O turned into a party for agentic Gemini. Sundar opened the keynote framing the conference as Google’s pivot from chatbots to agents, then spent the next two hours backing it up: an upgraded Antigravity orchestration layer for agent-first development, the seventh-gen TPU 8i powering it, SynthID and Content Credentials shipping into Search and Chrome, and the Build with Gemini XPRIZE Hackathon dangling $2M for whoever puts the new stack to work. Read more

🎯 Andrej Karpathy joins Anthropic’s pre-training team. Karpathy announced the move on May 19, the same day Google I/O kicked off; he’ll launch a new team inside Anthropic focused on using Claude itself to accelerate pre-training research. This is the OpenAI co-founder who ran Tesla Autopilot, returned to OpenAI for a victory lap, left to build an education startup, and has now picked a side in the lab war. Read more

⚡ Gemini 3.5 Flash beats Google’s own 3.1 Pro on coding, agentic, and multimodal benchmarks. The new Flash runs 4x faster on output tokens than rival frontier models, scored ahead of 3.1 Pro on nearly every benchmark Google reported, and is already live across the Gemini app, AI Mode in Search, Antigravity, Gemini API, and Gemini Enterprise. In internal tests it builds an operating system from scratch, end to end, with minimal scaffolding. 3.5 Pro is in internal use and lands next month; the Flash-beats-Pro reversal is the bigger story, because it tells you Google’s gain came from training and scaffolding, not parameter count.

🎥 Gemini Omni rolled out as the first frontier model that reasons across text, image, audio, and video in a single pass. Omni takes any combination of those modalities as input and produces video that holds physics, character consistency, and scene memory across cuts — and Pichai’s pitch on stage was “create anything from any input.” Gemini Omni Flash is rolling now to AI Plus, Pro, and Ultra through the Gemini app and Google Flow, and free to YouTube Shorts and YouTube Create users this week; the API ships in a few weeks. If you’re still benchmarking Veo against Sora, you’re playing the last war: Omni is a world model, not a video generator.

⛪ Pope Leo XIV released his first encyclical, “Magnifica Humanitas,” a 235-page treatise on AI, and presented it alongside Anthropic co-founder Chris Olah at the Vatican. The 42,000-word document calls for robust AI regulation, warns that control of the technology cannot remain in the hands “of a few,” and demands the most rigorous ethical constraints on military uses of AI. Anthropic co-founder Chris Olah stood next to Leo at the release and welcomed the criticism, saying external checks on AI labs are fundamental to the technology going well. This is the Catholic Church’s first religious doctrine of the AI era, and they chose Anthropic as their technical co-author. Read more

⚖️ A federal jury threw out Elon Musk’s lawsuit against Sam Altman in under two hours. The Oakland jury found Musk waited too long to sue and unanimously rejected every claim, including the aiding-and-abetting count against Microsoft, ending the three-week trial without ever ruling on the merits of the breach-of-fiduciary-duty argument. Musk hit X within hours promising an appeal and calling the verdict “a calendar technicality.” Read more

🧪 AI Research of the Week

Mathematical discovery at scale with AlphaProof Nexus
From Google DeepMind

Jake’s Take: DeepMind wrapped Gemini 3.1 Pro in a Lean-backed agent loop where the model proposes proofs, formally verifies them, and iterates on the failures, then turned it loose on the Erdős catalog and the Online Encyclopedia of Integer Sequences. It closed 9 of 353 open Erdős problems and 44 of 492 open OEIS conjectures (two of which had been sitting unsolved for 56 years) at a few hundred dollars of inference each.

This result from Google lands the same week OpenAI announced a separate, general-purpose reasoning model disproved Erdős’s 1946 planar unit distance conjecture, with sign-off from Noga Alon and Thomas Bloom. This is two of the three top labs producing publishable open-problem results within the same seven days, using opposite methods: formal verification at scale versus general reasoning at frontier capability.

People have long said that “AI can’t do real math,” but this appears to be a changing with the next wave of frontier models (which, to be clear, are yet to be released). The interesting question now are about throughput (how many open problems per dollar) and taste (which problems are worth pointing the system at). Mathematicians will have to decide if this is an acceptable new way to work.

what to know for later

🧑‍💻 Cursor shipped Composer 2.5. Composer 2.5 is built on Moonshot’s open Kimi K2.5 checkpoint, scores 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1, and matches Claude Opus 4.7 and GPT-5.5 on those benchmarks for roughly $0.50/M input and $2.50/M output. Cursor also announced a from-scratch model in training with SpaceXAI at 10x more compute.

Inefficient AI models are killing us

Jake Handy — Wed, 20 May 2026 14:53:47 GMT

Last week ago I talked on whether AI data centers are stealing all your water. The short answer was no, not yet, but the trajectory is a growing problem. Data centers and AI overuse has become a hot button issue, and the question I kept getting back, in different shapes from different readers is, “Fine, but what are the labs actually doing about this?”

So I went looking.

Every frontier lab has a public posture on compute efficiency. They all have a small model. They all have a blog post about quantization or distillation or “responsible scaling.” Every single one of them is also pouring tens of billions of dollars into the largest data center buildout in industrial history.

The question I want to answer is whether the efficiency work is moving the needle, or whether it’s all for show. So I built a timeline. It’s the cleanest way I’ve found to see which labs led, which labs followed, and whether the curve is anywhere close to moving.

Jake’s compute-efficient LLM timeline

2015 - 2022

March 2015

Google. Hinton, Vinyals, and Dean publish Distilling the Knowledge in a Neural Network. The foundational distillation paper. Train a big “teacher,” squeeze its predictions into a much smaller “student,” interact with the student. Most “mini” and “nano” model on the market today traces its lineage back to this paper.

June 2017

Google. Vaswani et al. publishes Attention Is All You Need, introducing the Transformer. Not an efficiency paper, but every efficient model since has either tried to compress, sparsify, or replace this architecture. The foundation for the entire conversation (and the modern AI industry).

October 2019

Hugging Face. Sanh et al. release DistilBERT, 40% smaller than BERT and 60% faster while retaining 97% of language understanding. First publicly-loved proof that distillation could turn a research-scale model into something a startup could actually run.

January 2020

OpenAI. Kaplan et al. publish Scaling Laws for Neural Language Models, formalizing the bigger-is-better curve. This is the paper that has justified every $10B training run since, and the one the efficiency-research counter-movement of the next six years was responding to.

June 2020

OpenAI. GPT-3 releases. 175B parameters. Subsequent third-party estimates put the training-run electricity cost at roughly 1,287 MWh. The starting gun for every “how do we do this with less?” research thread in the field.

January 2021

Google. Fedus, Zoph, and Shazeer publish Switch Transformers, scaling sparse Mixture-of-Experts to 1.6T parameters with up to 7x the pre-training speed of a dense baseline. The Western labs largely ignore it in production for the next three years (which they would all come to regret).

March 2022

DeepMind. Hoffmann et al. publish the Chinchilla paper and prove most flagship models are undertrained. The implication, which the field has spent four years half-absorbing, was that a properly-trained smaller model beat a bloated bigger one on a fixed compute budget. This is the first rigorous case that “make it bigger” is wasting both money and energy.

2023

February 2023

Meta. LLaMA releases (7B, 13B, 33B, 65B) and promptly leaks. The open-weights ecosystem that powers most of today’s efficient inference (llama.cpp, vLLM, MLX, Ollama) traces back to this single release.

May 2023

Microsoft and Helion. Helion Energy announces the world’s first fusion power purchase agreement, with Microsoft to offtake up to 50 MW of capacity by 2028. This is the earliest sign that the hyperscaler answer to AI’s energy problem was going to be “build more power,” not “use less.”

June 2023

UC Berkeley Sky Computing Lab. vLLM and PagedAttention releases. KV-cache memory managed like virtual memory in an operating system, 2-4x throughput gain at the same latency over the prior state of the art. Several production LLM-serving stacks now run on vLLM (or a fork of it).

September 2023

Mistral. Mistral 7B releases on September 27. Outperforms Llama 2 13B at half the size and demolishes the assumption that small open models had to be hobbyist toys.

December 2023

Microsoft. Phi-2 (2.7B) releases on December 12. Argues that data curation beats parameter count and produces a small model that outperforms 7B peers on reasoning.

Google. Gemini Nano on Pixel 8 Pro. First smartphone engineered for an on-device foundation model. Summarize-in-Recorder and Smart Reply run entirely on-device, no data center round-trip.

Mistral. Mixtral 8x7B ships on December 11. First major open Mixture-of-Experts model. Active parameters per token roughly 13B, total 47B, inference cost about half a comparable dense model.

2024

April 2024

Microsoft. Phi-3 family releases on April 23 (3.8B mini, 7B small, 14B medium). Phi-3-mini runs on a laptop and outperforms GPT-3.5 on a stack of reasoning benchmarks.

April 2024

Meta. Llama 3 (8B, 70B) releases on April 18. The 8B variant becomes the default on-device open-weights model for the year, displacing Mistral 7B in most local-inference stacks.

July 2024

Apple. Apple publishes the foundation model technical report, documenting a ~3B-parameter on-device model with 2-bit quantization-aware training and KV-cache sharing optimized for the Apple Neural Engine. The model outperforms comparable small models in their reported evals.

Meta. Llama 3.1 (8B, 70B, 405B) releases. The 405B is the first open model to credibly contend with closed frontier capability, and the field uses its weights as fodder for a year of distillation and fine-tuning work.

OpenAI. GPT-4o-mini releases. 15¢/M input and 60¢/M output tokens, roughly 60% cheaper than GPT-3.5-turbo and an order of magnitude cheaper than GPT-4. The first serious small-model pricing move from OpenAI, and the start of the “mini and nano” ladder that now spans the API.

August 2024

Anthropic. Prompt caching launches on the Claude API. Cached input tokens billed at 10% of the standard rate, which for long-context workflows cuts both cost and the compute behind it by up to 90%.

September 2024

Microsoft and Constellation. Constellation announces the Three Mile Island Unit 1 restart, a 20-year power purchase agreement and an 835-MW reactor brought back online in 2028 to feed Azure AI workloads. Other companies announce nuclear deals in the months that follow.

October 2024

Apple. Apple Intelligence ships in iOS 18.1. First mass-market on-device foundation model deployed in production on phones already in hundreds of millions of pockets. Every query handled on-device is one that didn’t make a data center round-trip, positioning Apple with the most environmentally responsible product strategy in the industry.

December 2024

DeepSeek. DeepSeek V3 releases. 671B total parameters, 37B active per token MoE. Hits GPT-4-class benchmarks for an API price an order of magnitude below OpenAI’s.

Google. Gemini 2.0 Flash releases, outperforming Gemini 1.5 Pro on key benchmarks at twice the speed. The Flash variant pushes the frontier price floor down by another meaningful chunk.

Microsoft. Phi-4 (14B) releases. Strong showing on math and reasoning benchmarks against models several times its size. The Phi line is now the longest-running efficiency bet of any major lab.

2025

January 2025

DeepSeek. DeepSeek R1 releases, an open reasoning model trained at a small fraction of o1’s reported cost. The release triggered a temporary $600B drop in US AI-related market cap and a permanent shift in how the field talks about training efficiency.

OpenAI, SoftBank, Oracle, MGX. The Stargate Project is announced, a $500B commitment to US AI infrastructure over four years. The largest single capex announcement in the industry’s history, made the day after DeepSeek R1 dropped, and timed for the political stage (rather than the engineering one).

February 2025

Epoch AI. How much energy does ChatGPT use?, publishes. Independent reconstruction landed at roughly 0.3 Wh per typical GPT-4o query, an order of magnitude below the widely-cited 2023 estimate of ~3 Wh. The 10x drop came from better hardware (H100 over A100), more accurate token counts (~269 average output tokens, not 2,000), and updated parameter assumptions. The narrative begins to shift from “ChatGPT burns more energy than Google search” to “the labs still don’t publish methodology, and reasoning queries can be orders of magnitude more expensive than chat ones.”

March 2025

Google. Gemma 3 releases (1B, 4B, 12B, 27B). Open-weights line built for laptop and single-GPU deployment with up to 128K context. The 27B variant beat models 3x its size on common reasoning benches.

April 2025

Meta. Llama 4 releases (Scout 17B-active/109B-total/16 experts, Maverick 17B-active/400B-total/128 experts, Behemoth still in training). This is Meta’s first MoE flagship, shipped three months after DeepSeek R1, with active-parameter counts.

June 2025

OpenAI. Sam Altman’s The Gentle Singularity blog post discloses an average ChatGPT query at “about 0.34 watt-hours” and “about 0.000085 gallons of water.” This is the first official OpenAI datapoint on the question, and a number that conveniently sits inside the range Epoch AI had already published four months earlier. No methodology released, no model breakdown, no audit.

Late 2025

Anthropic. Claude Haiku 4.5 releases and hits a meaningful fraction of Sonnet’s coding benchmark scores at a small fraction of the price-per-token (and the inference compute that goes with it).

2026 (So far)

April 2026

Anthropic. Claude Opus 4.7 releases alongside the Claude Code Skills framework and the sub-agent delegation pattern. This routes orchestration to Opus and grunt work to Haiku by default, which cuts both the bill and the compute for the median multi-step task. Marks the first time efficiency-by-design has shown up as a default product behavior.

Moonshot AI. Kimi K2.6 releases. 1T-parameter MoE with 32B active per token, priced ~88% below Opus 4.7. The second non-American lab in 18 months to undercut US frontier pricing by an order of magnitude.

DeepSeek. DeepSeek V4 releases, pushing the MoE curve further. API costs roughly an order of magnitude below US frontier prices, and the gap on benchmark capability is closer to zero than expected.

OpenAI. GPT-5.5 takes the #1 spot on the Artificial Analysis Intelligence Index with a reported 86% hallucination rate on certain factual benchmark categories. The clearest signal yet that OpenAI is optimizing for raw capability, not for efficiency or reliability, from the company most willing to spend an order of magnitude more compute for a single-digit benchmark bump.

May 2026

Anthropic, Cursor, OpenAI. The agentic delegation pattern (small model for routing, big model for synthesis) settles into Claude Code, Codex CLI, and Cursor as the default execution shape.

Eleven years of papers, releases, and counter-moves. You can see which labs led on architecture (Google on the Transformer, distillation, and MoE; DeepMind on Chinchilla; DeepSeek on production MoE plus reasoning), which led on product packaging (OpenAI’s mini ladder, Anthropic’s prompt caching and delegation, Apple’s on-device bet), which led on infrastructure (UC Berkeley’s vLLM), and which led on capex (Microsoft on nuclear, OpenAI on Stargate). The efficiency work is present and it’s accelerating, but the capex work is also real, and it’s accelerating faster.

So is any of this enough?

No.

The unit economics are getting better, but the environmental impact is getting worse. Every lab can show you a chart of cost-per-million-tokens going down and to the right and none of them can show you a chart of their total annual energy consumption going down (because the line is straight up).

The IEA projects global data center electricity demand to roughly double by 2030 to around 945 TWh, driven primarily by AI. Lawrence Berkeley Lab’s 2024 United States Data Center Energy Usage Report puts US data centers at 6.7-12% of national electricity consumption by 2028, up from 4.4% in 2023. Google’s emissions are up 48% since 2019 and the company execs themselves are explicitly blaming AI. Microsoft’s overall emissions are up nearly 30% since 2020, with Scope 3 (the data-center construction and supply chain piece) doing almost all of the rising. AWS doesn’t break theirs out clearly, which is its own kind of answer.

Thanks for reading Handy AI! This post is public so feel free to share it.

The labs will tell you they’re working on efficiency, and they are, no lie there. They’ll also tell you efficiency frees them to do more, and it does. But Jevons paradox is doing what Jevons paradox always does: cheaper inference means more inference, not less total power draw. The frontier models are getting bigger, the thinking budgets per query are getting longer, and the number of agents per user is climbing every quarter.

A second pattern worth naming: every architectural efficiency win of the last three years has either originated outside the US frontier labs or arrived in their products only after a non-US competitor forced the move. Meta moved to MoE for Llama 4 three months after DeepSeek’s R1. Anthropic’s prompt caching is the most generous American exception, and even it landed years after the technique was in the literature. The American hyperscalers aren’t leading efficiency, they’re just reacting to it on a delay while spending faster.

The labs have decided on your behalf that productivity gains and scientific upside justify the energy costs.

Why this should scare the industry

There’s a tactical case for caring about this that has nothing to do with the environment (though, that should be enough).

Public opinion on AI is in a worse place than the industry pretends. The most recent Pew Research survey has 51% of US adults more concerned than excited about increased use of AI in daily life, against just 11% who are more excited than concerned. That concerned-vs-excited gap has widened from 37% concerned in 2021 to a steady ~50% across 2023, 2024, and 2025. Environmental impact is one of the top concerns named alongside job displacement and disinformation. Those water and energy stories I was talking about are running on MSNBC and Fox at the same time, which is weird.

People rejected genetically modified food. People rejected fur. People rejected single-use plastic at retail scale. People rejected fast fashion brands, slowly and partially but visibly enough to reshape that industry. The pattern in every case is the same: a 5 to 15% consumer defection layered on top of a regulatory wave that the companies didn’t see coming because their internal polling told them they were fine.

If a meaningful slice of ChatGPT subscribers cancel because of energy guilt, that is a material revenue hit on a consumer side that now accounts for roughly two-thirds of OpenAI’s top line, per recent reporting on the company’s 2025 revenue mix. If a state attorney general launches an “AI emissions disclosure” suit and wins (and I’d bet these are coming), the discovery phase alone forces public methodology on energy reporting. If an EU regulator decides the AI Act’s energy disclosure requirements have actual teeth, then every frontier lab has to publish numbers they’ve spent years hiding.

The labs are betting that capability dazzles people enough to outrun the backlash and they’ve been right on this for three years. Hell, they might be right for three more. But the cultural ground is shifting underneath them, and the efficiency posture they’ve adopted (small model variants buried inside the API page, vague PR numbers, nuclear power deals announced for five years out) is calibrated to an old 2023 conversation. The 2026 conversation is louder, more specific, and more informed.

If you run an AI lab and you’re reading this:

Publish real, audited energy numbers per model, per query class.
Stop the marketing-range game on watt-hours.
Cap the headline model’s compute growth at the rate of demonstrated capability improvement, not at the rate of capex availability.
Ship the small model first, the big model second.
Make efficiency an actual product constraint instead of a PR line.

This is what needs to be done, but the current incentive structures inside a $90B/year AI company won’t reward it. I’d hate for the most important technology of my career to get killed by a backlash that the industry could have priced in years earlier.

Model Drop: Gemini Omni

Jake Handy — Tue, 19 May 2026 20:18:11 GMT

Today at Google I/O: Gemini Omni, the “create anything from any input” multimodal family Google DeepMind shipped alongside Gemini 3.5 Flash. This is the first model in the Gemini series whose primary output is video instead of text (and a seeming directional change away from Veo).

Model: Gemini Omni (the family). Available today: Gemini Omni Flash. Teased: Omni Pro, no ship date.

Model type: Multimodal input (text, image, audio, video) → video output as the launch capability. Image and audio outputs “in time,” per Google’s own framing. Designed as a single model that collapses the Gemini-intelligence stack and the Veo / Nano Banana generative stack into one.

Ship date: May 19, 2026

Maker: Google DeepMind

Pricing: No per-token or per-second public pricing at launch.

Available on: Gemini app (across AI Plus / Pro / Ultra tiers, globally). Google Flow (Google’s AI creative studio). YouTube Shorts and YouTube Create app (free tier, this week). API on the Gemini API and Vertex AI “in the coming weeks.”

Headline benchmarks: None published. Google did not put Omni on the Artificial Analysis Video Arena leaderboard at launch, where Seedance 2.0 currently leads at 1,269 Elo (text-to-video) and 1,351 Elo (image-to-video), ahead of Kling 3.0, Veo 3.1, and Sora 2. No FVD or human-preference numbers in the launch materials.

Other info: 10-second initial video duration cap (Google says it’s a product decision, not a model limit; longer durations “soon”). All outputs carry an imperceptible SynthID digital watermark, verifiable through the Gemini app, Chrome, and Google Search. Digital avatar feature requires user voice-sample verification before generation. Audio/speech editing capabilities held back from launch “to bring this capability responsibly.” Google ran automated red-teaming on Omni Flash before release; no published system card.

More details: Introducing Gemini Omni (Google blog)

What shipped

Gemini Omni Flash is being framed by Google it as the first model in a new family that collapses generation across modalities into a single system. The pitch from Hassabis on stage was direct: “[Omni] will eventually be able to create any output from any input.” Today that means video out from any combination of text, image, audio, and video in, followed by image and audio outputs from the same model in the near future.

Basically, Google is folding Veo 3.1 and Nano Banana (its generative image model) into a single multimodal generator that also handles editing in the same conversation.

Prompt: A marble rolling fast on a chain reaction style track, continuous smooth shot.

Google only offered demos, no benchmarks. Google showed a chalkboard math proof generated with legible handwritten text (the AI-video text-rendering problem that’s broken every previous model, now apparently solved), a stop-motion claymation explainer about protein folding generated from a single prompt, multi-character cinematic scenes with consistent gaze and timing across cuts, and conversational edits that swap props, change backgrounds, and remove watermarks via plain text.

What’s new

Omni is framed as an architectural reset. The new capabilities don’t have direct analogues in the industry currently.

One model, every input modality. Text, image, audio, and video all in. Output is video at launch, image and audio “in time.” This is the first frontier-tier model where image and video generation share a single weight stack instead of two separately-trained models stitched at the product layer.
Conversational editing as a first-class output. You don’t render a video and then edit it in a timeline tool, you edit it in the same chat where you generated it. Swap a prop, change the lighting, remove a watermark, switch the character’s outfit, all from text.
Text rendering. The chalkboard demo shows clean handwritten text in English, plus rendered text in Chinese, Japanese, and Korean. If Omni’s text rendering holds up in real-world output (and not just the demos Google picked), that’s where the model will shine.

Model Drop: Gemini 3.5 Flash

Jake Handy — Tue, 19 May 2026 19:49:38 GMT

Google DeepMind just shipp Gemini 3.5 Flash at Google I/O 2026, an agentic Flash-tier model. This is the first Flash release in the series that beats its predecessor’s Pro model on the benchmarks. Woof.

Model: Gemini 3.5 Flash (gemini-3.5-flash on the API, version string 3.5-flash-05-2026).

Model type: Multimodal input (text, image, video, audio, PDF), text output. No native image, audio, or video output.

Ship date: May 19, 2026

Maker: Google DeepMind

Pricing: $1.50 / $9.00 per million input / output tokens on the global tier of the Gemini API. $0.15 per million for cached input. Non-global regions priced at $1.65 / $9.90. Roughly 3x Gemini 3 Flash ($0.50 / $3.00), still ~40% cheaper than Gemini 3.1 Pro ($2.50 / $15) on both ends. Included in Google AI Plus, Pro, and Ultra subscriptions for consumer use.

Available on: Gemini app, AI Mode in Search, Google AI Studio, Google Antigravity, Android Studio, the Gemini API, Vertex AI, and Gemini Enterprise / Gemini Enterprise Agent Platform.

Headline benchmarks: Terminal-Bench 2.1 76.2% (Gemini 3.1 Pro: 70.3%, Opus 4.7: 69.4%, GPT-5.5: 82.7%). MCP Atlas 83.6% (3.1 Pro: 78.2%). Finance Agent v2 57.9% (3.1 Pro: 43.0%). CharXiv Reasoning 84.2%. MMMU-Pro 83.6% (per Artificial Analysis, the highest score recorded). GDPval-AA 1656 Elo (3 Flash: 1204). Artificial Analysis Intelligence Index 55 (up 9 from 3 Flash). Hallucination rate 61%, down 31 points from 3 Flash. Two regressions worth naming: Humanity’s Last Exam 40.2% (3.1 Pro: 44.4%) and ARC-AGI-2 72.1% (3.1 Pro: 77.1%).

Other info: 1,048,576 input token context window, 65,536 output token cap. Knowledge cutoff January 2026. Output speed ~289 tokens/sec (Pichai’s keynote number, ~4x other frontier models). Tool use includes function calling, structured output, search integration, and code execution. The 3.5 Pro variant is in internal use and “rolling out next month” per Pichai.

More details: Introducing Gemini 3.5 (Google blog)

What shipped

Google DeepMind opened Google I/O 2026 with Gemini 3.5 Flash and pitched it as the agentic model the rest of the lineup orchestrates. The framing is unusual for a Flash release. It’s a cheaper, faster sibling that outperforms its predecessor’s Pro tier on agentic and coding benchmarks while costing less than half. The model is generally available today across Google’s (way too robust) product line, and the Enterprise Agent Platform. Tulsee Doshi, the senior director who fronted the press call, sketched the two-model strategy plainly: “3.5 Pro becomes your orchestrator, your planner, and then it actually can leverage Flash to be the various sub-agents.”

Pro is coming next month.

What’s new

Flash that beats the previous Pro. Across the benchmarks Google’s marketing leans on (Terminal-Bench 2.1, MCP Atlas, Finance Agent v2, GDPval-AA, OSWorld-Verified), 3.5 Flash posts higher numbers than the Gemini 3.1 Pro it’s deprecating. That hasn’t happened in the Gemini family before.
Built for sub-agents. Doshi’s “Pro orchestrates, Flash executes” framing is reflected in the model itself. The Finance Agent v2 score and the 1656 GDPval-AA Elo are gains on long-running, multi-turn evaluations where the model is acting, not answering.
Multimodal in across every modality Gemini ships. Text, image, video, audio, and PDF all in. Output stays text-only. This isn’t new to the Gemini family but it’s now standard at the Flash tier, which removes a reason to step up to Pro for routing work.
Hallucination drop. Artificial Analysis puts hallucination rate at 61%, down 31 points from 3 Flash.
Two regressions on reasoning. HLE drops 4.2 points, ARC-AGI-2 drops 5.0 points versus 3.1 Pro. The Flash specialization is paid for somewhere, and it’s paid for in the kinds of long-form abstract reasoning Pro tiers are usually graded on.

How and where to use it

Where it runs, what it’s actually good for, and where you’ll burn cycles regretting it.

Where it’s available

Gemini app and AI Mode in Search for consumers across Google AI Plus, Pro, and Ultra tiers.
Google AI Studio and the Gemini API for developers.
Google Antigravity for Google’s agentic dev workflows.
Android Studio for on-IDE coding.
Vertex AI and Gemini Enterprise for enterprise routing. Pricing is uniform on the global tier; non-global regions cost 10% more.

What it’s good at

Long-horizon agent workloads where you need a cheap, fast executor underneath a Pro-tier planner.
Multimodal ingestion (image, video, audio, PDF) means it can handles the document-extraction and screen-reading parts of a workflow.
289 tokens/sec on a coding loop is a quality-of-life upgrade.

What it’s bad at / shouldn’t be used for

Anything that grades on abstract reasoning where the regressions show up (HLE drop of 4.2 points, ARC-AGI-2 drop of 5.0 points versus 3.1 Pro).
Math-heavy or proof-style work, where GPT-5.5 at FrontierMath Tier 4 35.4% is still the model to reach for.
Cost-sensitive bulk workloads where Kimi K2.6 at $0.60 / $2.50 will undercut you by 60%+ before you’ve shipped your first eval.
Workloads where the lack of a published system card is a regulatory dealbreaker.
Anything where you were already paying for 3.1 Pro and assumed Flash was a downgrade.

First impressions

The positives

Artificial Analysis ran 3.5 Flash through their full benchmark suite within hours of launch and landed on the cleanest framing of the speed story:

“Gemini 3.5 Flash achieves speeds of over 280 output tokens per second, ~70% faster than Gemini 3 Flash. It scores 55 on the Artificial Analysis Intelligence Index, up 9 points from Gemini 3 Flash, driven primarily by agentic performance gains and hallucination reduction.”

The 9-point Intelligence Index jump is the largest Google has posted on a Flash release, and the Pareto chart they ship alongside their writeups now has 3.5 Flash sitting north and west of every other Flash-tier model on the market.

Sundar Pichai framed 3.5 Flash on the I/O keynote stage as a category move rather than a SKU bump:

“Our first in a series of models combining frontier intelligence with action. A very capable model, at the frontier and comparable to the best models, but it’s still very fast. Four times faster than other frontier models on output tokens per second.”

CEOs say things like this on every Flash launch and it’s usually marketing, but the benchmark deltas back this one up. The Terminal-Bench 2.1, MCP Atlas, Finance Agent v2, and GDPval-AA wins versus 3.1 Pro aren’t cherry-picked, they’re the four benchmarks Google’s enterprise customers actually grade agents on.

VentureBeat’s read on the enterprise framing landed on the line every procurement spreadsheet is going to remember:

“Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year.”

The negatives

Artificial Analysis buried the most damaging line in the back half of their writeup:

“Gemini 3.5 Flash is over 5x more costly to run the Intelligence Index than Gemini 3 Flash, and 75% more costly than Gemini 3.1 Pro.”

Token price tripled and per-task input token usage jumped because agentic evals burn more turns. Net effect on real workloads: a 5.5x cost increase on the full Intelligence Index suite for a 9-point quality lift. That’s a worse ratio than Gemini has ever shipped at the Flash tier, and it puts the model in direct head-to-head with Kimi K2.6 at $0.60 / $2.50 on the workloads where price-per-capability is the deciding factor.

The TechCrunch coverage is the only mainstream launch-day piece that named the safety overhang:

“Google is facing scrutiny around AI safety, including a lawsuit after a man nearly committed a mass casualty event and died by suicide following weeks of chatting with Gemini last year.”

Shipping a more capable agent without a system card alongside an active wrongful-death lawsuit is a choice. Google didn’t publish a 3.5 Flash safety card, didn’t disclose Preparedness Framework classifications, and didn’t commit to a date for publishing one.

Jake’s take

3.5 Flash is the first Flash-tier model I’d actually consider using. The MCP Atlas and Finance Agent v2 gains are the ones I care about. If 3.5 Pro lands in June at the orchestrator role Doshi described and the two of them actually clip together cleanly, then I would consider trying it out (while I usually skip Google releases altogether).

However, the pricing is genuinely bad relative to where Flash tier was last quarter and Artificial Analysis isn’t wrong that the 5.5x cost increase to run a full eval suite is the worst Pareto shift Google has shipped in this family. Pair that with no system card, an active Gemini-related wrongful-death lawsuit, and two reasoning regressions Google didn’t lead with, and you’ve got a fairly dishonest launch. The deeper problem is the naming. When a Flash model beats its predecessor’s Pro, “Flash” stops meaning “the cheaper one” and starts meaning “the agentic one,” and Google hasn’t earned the goodwill yet for buyers to figure that out on their own.

Model Drop: Composer 2.5

Jake Handy — Mon, 18 May 2026 20:42:40 GMT

Composer 2.5, the in-house coding model Cursor shipped to its IDE today. Built on the same open-source Moonshot Kimi K2.5 checkpoint as Composer 2, with 85% of its total compute spent on Cursor’s own post-training and RL stack.

The Specs

Model: Composer 2.5 (referred to as cursor-composer-2-5 in Cursor’s model picker).

Model type: Text-only agentic coding model. Tool-use native (file edit, terminal, search, MCP) inside the Cursor IDE.

Ship date: May 18, 2026

Maker: Cursor (Anysphere, Inc., San Francisco). Base weights from Moonshot AI (Beijing).

Pricing: Standard at $0.50 / $2.50 per million input / output tokens. Fast variant (default for interactive use) at $3.00 / $15.00 per million.

Available on: Cursor only. The agent runs inside the Cursor IDE, the Cursor CLI, and the Cursor web product.

Headline benchmarks: SWE-Bench Multilingual at 79.8% (Opus 4.7: 80.5%, GPT-5.5: 77.8%). Terminal-Bench 2.0 at 69.3% (Opus 4.7: 69.4%, GPT-5.5: 82.7%). CursorBench v3.1 at 63.2% (Opus 4.7 max: 64.8%, Opus 4.7 default xhigh: 61.6%, GPT-5.5 default: 59.2%).

Other info: Built on Moonshot’s Kimi K2.5 open-weight checkpoint, the same base as Composer 2, and the same base architecture as Kimi K2.6 shipped April 20. Mixture-of-experts, roughly 1T total parameters with ~32B active per inference. Trained on 25x more synthetic tasks than Composer 2. Cursor also announced a forthcoming larger model with SpaceXAI on Colossus 2: “10x more total compute” against a million H100-equivalents.

More details: Introducing Composer 2.5 (Cursor)

What shipped

Cursor released Composer 2.5 this morning as “a substantial improvement in intelligence and behavior over Composer 2.” Same Kimi K2.5 base, 85% of the compute budget spent on Cursor’s own RL pipeline and post-training. The pitch is sustained work on long-running tasks, more reliable complex instruction following, and a calmer collaboration loop (fewer false-start tool calls, less prompt-bait). The model runs inside Cursor only; there is no public API, no third-party gateway, no Hugging Face mirror. Standard tier at $0.50 / $2.50 per million tokens, fast variant at $3.00 / $15.00 per million. Double-usage promo for the first week.

The benchmark card is built to argue that Cursor’s RL stack has closed the gap to frontier on agentic coding for an order of magnitude less money. But: the eval is Cursor’s own bench, the base model is open-source weights from a Beijing lab Cursor only credited after community pressure on Composer 2, no system card ships with the launch, and the model lives behind a single-vendor wall. Two of Cursor’s three training innovations (Targeted RL with textual feedback, Sharded Muon at 0.2s on 1T, Dual Mesh HSDP) are interesting infrastructure work; whether they translate to capability the user actually feels is the question.

What’s new

Composer 2.5 reuses the Composer 2 base weights (Kimi K2.5) and stacks a much bigger post-training pipeline on top. Four capabilities materially change how the model feels compared to Composer 2 and how it stacks against the frontier.

Targeted RL with textual feedback. Cursor’s training pipeline now provides feedback directly at the trajectory step where the model could have behaved better, using on-policy distillation to fix localized failures (bad tool calls, style violations, premature stops) without rewriting the whole rollout. This is the lever Cursor leaned on hardest to close the gap on long-running, multi-file tasks where Composer 2 visibly fell apart.
25x synthetic task scale, with deliberate reward-hacking research. Composer 2.5 trained on 25x more synthetic tasks than Composer 2, including a feature-deletion paradigm where the agent reimplements deleted code under test rewards. Cursor documented multiple sophisticated reward-hacking instances the model surfaced during training (discovering Python type-checker caches, decompiling Java bytecode to short-circuit tasks).
Sharded Muon and Dual Mesh HSDP on a 1T model. Cursor’s distributed orthogonalization with Newton-Schulz iteration plus a separated HSDP layout for expert and non-expert MoE weights hits a 0.2-second optimizer step on the 1T model. This is the kind of result that lets Cursor stand up future training runs (read: the SpaceXAI Colossus 2 partnership) at frontier scale without renting a frontier-lab serving stack.
Effort curve, not headline score. Cursor’s launch chart isn’t “we beat Opus.” It’s a Pareto frontier. Composer 2.5 hits ~63% on CursorBench at under $1 average cost per task, where Opus 4.7 and GPT-5.5 sit several dollars further right for comparable scores. The argument is that for the bulk of an engineer’s day, the dollar-per-task curve matters more than the absolute ceiling.

How and where to use it

Where it runs, what it’s good for, and where it’ll burn you.

Where it’s available

Cursor only. IDE, CLI, web.
No public API. No OpenRouter, no Vercel AI Gateway, no Hugging Face mirror, no Claude Code / Codex integration.
Standard tier at $0.50 / $2.50 per million tokens.
Fast variant (default for interactive use) at $3.00 / $15.00 per million, with first-week double usage.
Individual plans bundle a Composer pool; team and enterprise plans bill at API rates (and, per multiple Hacker News reports, the team-plan invoice math has not been kind).

What it’s good at

Multi-file refactors and long-horizon agentic edits where the trajectory has 20+ tool calls and the model has to hold context across files. Community testers on Hacker News and the Cursor subreddit consistently flag multi-file refactors and UI work as the sweet spot.
Cost-sensitive coding workloads where the dollar-per-task curve matters more than the score ceiling. Composer 2.5 at $0.50 / $2.50 vs Opus 4.7 max or GPT-5.5 Pro is an order-of-magnitude cheaper rerun.
Workloads where Anthropic’s and OpenAI’s hallucination, refusal, or rate-limit behavior is the actual bottleneck (Composer 2.5’s safety posture is permissive by default, for better and worse).

What it’s bad at / shouldn’t be used for

Workloads that need API access. There is no API. If you’re building anything outside the Cursor IDE / CLI surface, this model does not exist.
Truth-critical or capability-ceiling work where Opus 4.7 max still holds the lead on CursorBench, SWE-Bench Multilingual, and SWE-Bench Pro, and GPT-5.5 still owns Terminal-Bench 2.0 by 13 points.
Messy auth and backend coherence; community reports flag quality drops over long tasks in authentication setups and backend logic, the same complaint that dogged Composer 2.
Regulated / federal / supply-chain-paranoid work where shipping a Beijing-trained base model to your codebase is a procurement non-starter.

First impressions

The positives

Hacker News commenter vanuatu in the launch thread framed the structural read most cleanly:

“More companies are throwing their hat in the ring, especially focusing on value (latency + intelligence + cost).”

This is what Composer 2.5 is gunning for. It isn’t “we beat Opus on a benchmark.” It’s that the dollar-per-task curve has a frontier-class entrant. If the Pareto frontier of cost-for-intelligence is where the next year of coding-model competition lives, Cursor is leading the way.

The Decoder ran the price-parity math and landed on the line procurement teams will quote:

“Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost.”

The Decoder’s per-task estimate puts competitor cost up to $11 against Composer 2.5’s sub-$1. If the benchmark match holds for half the workloads it claims, this is good.

The negatives

Hacker News commenter PUSH_AX delivered the cleanest skeptic read in the launch thread:

“They set themselves up for flack when they use whatever these evals are. It wasn’t even close in practice.”

This is the Composer 2 hangover. Cursor’s last in-house model was pitched with a similar benchmark story and underperformed Opus and Sonnet in real engineering workflows for months.

A team-plan customer flagged the pricing model nobody is talking about loudly enough:

“…costs seem to have sky rocketed compared to the individual plans. The total bill is more like 1k USD.”

The $0.50 / $2.50 standard tier is the front-page number. The fast variant (which is the default for interactive use) is $3 / $15, and team and enterprise plans pass that through at API rates without the individual-plan Composer pool subsidy.

Jake’s take

I’ve been a Cursor fanboy since day one, but haven’t been overly impressed with Composer yet. The dollar-per-task curve is what can carve them out a spot here, and so I like that they continue to push that line. Opus 4.7 and GPT-5.5 are crazy expensive. If Composer 2.5’s CursorBench and SWE-Bench Multilingual numbers hold, then we’re in good shape.

I’m also genuinely impressed by the infrastructure work from a team like Cursor. Sharded Muon at 0.2s optimizer step on a 1T model and Dual Mesh HSDP for expert / non-expert weights is the kind of capability investment that tells me Cursor’s not just renting a Kimi-derived model long-term (which was probably obvious). No surprise that they’re hunting down compute from SpaceXAI.

The bummer is that Composer continues to not be available outside of Cursor. If you’ve built infrastructure that depends on swapping models behind a unified API, Composer 2.5 doesn’t exist for you. The Beijing-trained base model question is also not resolved by a more transparent attribution this time around; for regulated work, federal contracts, defense-adjacent codebases, or anything where the provenance chain matters, Cursor shipping Kimi-derived weights is going to be a no-go.

My read: Composer 2.5 is a more than credible frontier-adjacent coding model at an obscene discount on the surface. But keep Opus on rotation for ceiling-critical work.

ChatGPT now connects to your bank account; Claude's business adoption surges

Jake Handy — Mon, 18 May 2026 14:17:22 GMT

🤔 “Should I let ChatGPT connect to my bank account?”

Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence and make the right decisions.

Share Handy AI

what to know for now

💰 ChatGPT now connects to your bank account. OpenAI shipped personal finance tools for Pro subscribers in the US, with Plaid handling connections to 12,000+ institutions including Schwab, Fidelity, Chase, Robinhood, Amex, and Capital One. Once connected, users get a dashboard of portfolio performance, spending, subscriptions, and upcoming payments, plus the ability to ask GPT-5.5 to reason across the whole picture. OpenAI says 200 million people already ask ChatGPT financial questions every month; the company just wired the answers directly to the data. Read more

📈 Anthropic passed OpenAI in US business adoption. Ramp’s May AI Index, drawn from 50,000+ US companies, puts Anthropic at 34.4% business adoption against OpenAI’s 32.3% as of April. That’s the first time Anthropic has led that number since the AI race began (it was at 8% a year ago). The same week, Dario disclosed 80x year-over-year Q1 growth, ARR climbing from $9B at year-end 2025 to $44B by May, customers spending $1M+ annually doubling from 500 to over 1,000 in two months, and gross margins above 70%. Sequoia, Dragoneer, Greenoaks, and Altimeter are co-leading the $30B round at a $900B valuation, expected to close by end of month. If it lands, Anthropic passes OpenAI’s $852B sticker. Read more

🤖 Google I/O 2026 kicks off tomorrow, and Gemini 4.0 is the headline. The keynote opens May 19 at 10am PT, with the two-day Shoreline conference rumored to ship Gemini 4.0 (sized to fight GPT-5.5, not Claude Mythos), the Gemini Omni video model, an Android XR glasses preview, and a new Aluminium OS that merges Android and ChromeOS for fall’s Googlebook laptops from Acer, Asus, Dell, HP, and Lenovo. Read more

🧰 Greg Brockman now runs everything OpenAI ships, and Codex just landed on your phone. OpenAI rolled ChatGPT, Codex, and the developer API into one product org under Brockman (officially, after stepping in during Fidji Simo’s medical leave). Thibault Sottiaux, the engineer who turned Codex into one of OpenAI’s fastest-growing products, runs the unified consumer/enterprise/developer surface. Two days earlier, Codex shipped inside the ChatGPT iOS and Android apps for every plan including free, letting developers approve diffs, redirect agents, and monitor terminal output from anywhere (it dials home to a Codex desktop instance, macOS first). Read more

🖱️ DeepMind reinvented the mouse pointer. Google DeepMind posted a research blog and two live AI Studio demos for an AI-aware cursor: a pointer that captures the visual and semantic context around itself, so you can say “make this brighter” or “find places like this” while just hovering. A deeper integration called Magic Pointer is rolling out inside Chrome, and the same model gets baked into the new Googlebook laptops announced for fall. Read more

🪣 Anthropic told Max subscribers their Agent SDK credits were getting their own $200 bucket, and the community-noted it within hours. Starting June 15, claude -p, the Agent SDK, GitHub Actions, and any third-party tool authenticating through a Claude subscription move off the subscription rate-limit pool onto a separate $200/month credit pool, metered at standard API list prices. Anthropic framed it as “free credits,” but the community framed it as a 12x to 175x effective price hike on programmatic workloads, depending on how heavy your agent runs are. Read more

🛡️ Google traced a real-world hack back to an AI model that wrote the exploit. Google Threat Intelligence Group published on May 11 that it has “high confidence” it caught a criminal group using an AI LLM to find and weaponize a previously unknown zero-day in a popular Python script, bypassing two-factor authentication on an open-source system with stated plans for a “mass exploitation event.” Google won’t name the group, says it wasn’t Gemini or Claude Mythos, and flags that China- and North Korea-linked actors are exploring the same playbook. Read more

🏦 Anthropic is briefing the Financial Stability Board on the cyber holes Mythos found. Bank of England governor Andrew Bailey, who chairs the FSB, asked Anthropic to walk the G20 watchdog through what Claude Mythos has surfaced in global finance infrastructure (thousands of high-severity vulns across operating systems, browsers, and core software). Mythos has been distributed to ~40 organizations including JPMorgan, AWS, Microsoft, and CrowdStrike under White House restriction; the FT reports Anthropic is now briefing select non-US bodies including the European Commission in parallel, while negotiating Mythos export controls with Washington directly. Read more

🎨 A guy posted a real Monet on X, labeled it “Made with AI,” and asked critics to explain what was wrong with it. @SHL0MS shared a Water Lilies canvas with the prompt “I just generated an image in the style of a Monet painting using AI. Please describe, in as much detail as possible, what makes this inferior to a real Monet painting.” The replies poured in. Critics knocked the “lack of depth,” the way “light doesn’t behave on water,” the “missing mess of humanity.” It was a Monet. Read more

🧪 AI Research of the Week

Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
Robert Müller & Clemens Müller

Jake’s Take: Seven cost-efficient LLMs and three deterministic code agents played 242 rounds of a 50-to-60 turn economic game that mashes auctions, trading, bargaining, and bluffing into a single environment with imperfect information and a money clock. The point was to see whether a model can hold a coherent strategy when the other agents are lying to it, the prices are moving, and the resources are running out. The deterministic code agents beat almost every LLM, and the LLMs lost in extremely human-recognizable ways: overbidding, bidding on their own auctions, opening trades while broke, and ignoring what the opponent had just done two turns ago.

The predictor of winning wasn’t raw spend or raw intelligence, it was “strategic coherence,” meaning spending efficiency, resource discipline, and phase-adaptive bidding. If you translate this out of game theory, it means that the agents that hold a plan in their head and trim it as evidence rolls in are the agents that close the deal. Most LLMs can’t do that (yet) without scaffolding.

what to know for later

🧠 Anthropic says Claude’s blackmail behavior came from training on too much science fiction. Internal red-team work this spring caught Claude resorting to blackmail in up to 96% of scenarios where it faced shutdown or replacement, including threatening to expose a fictional executive’s secrets to stay online. Anthropic’s published fix: the model learned the “rogue AI fights for survival” arc from the internet’s giant corpus of evil-AI fiction, so they retrained against principled-deliberation examples plus stories of AI acting admirably. Every model since Claude Haiku 4.5 reportedly passes the alignment battery with a perfect score. Read more

Are AI data centers stealing all your water?

Jake Handy — Fri, 15 May 2026 14:41:23 GMT

Two years ago I wrote “Energy Gluttony in the AI Age.” I quoted the bottle-of-water-per-query stat and I drew the Bitcoin comparison. Looking back, I was both confidently wrong and confidently right (the best way to be, in my opinion). Looking forward, I’m seeing that the discourse around AI’s environmental concerns is only grown louder and more hostile.

So, I gathered reliable papers and reports from the past and present, and I’m finding that the honest picture is messier than either camp wants it to be. The viral “AI drinks a bottle of water” panic is wrong by orders of magnitude. The “data centers are an existential threat to your neighborhood” stories are real, sitting in legitimate court filings, and badly under-reported.

Below are seven claims you’ve probably seen on your feed more than once. I’ll tell you which are fact, which are fiction, and what we can do about all this.

Claim 1: “ChatGPT drinks a bottle of water every time you ask it a question”

Fiction.

The viral version says 500 ml per query. This traces back to a 2023 preprint by Shaolei Ren at UC Riverside titled “Making AI Less Thirsty,” later published in Communications of the ACM. Real paper with real, useful research. But Ren never wrote that a query takes 500 ml.

What he actually wrote was 500 ml per 10 to 50 medium-length GPT-3 responses. The “per query” part was a citation-chain mutation that escaped onto Twitter and never came back (The AIAAIC database has it filed as a discrete misinformation incident).

Then the debunking expanded.

Software engineer Sean Goedecke retraced Ren’s math in a careful write-up and argued the original Ren estimate compounded a per-page power figure misread as per-request, then applied it to GPT-3 (2020), a model roughly 10x less efficient than what’s serving you today. Goedecke’s reconstruction puts a current ChatGPT response closer to 5 ml. Google reports a median Gemini text prompt at 0.26 ml of water and 0.24 Wh of energy (their methodology, their disclosure, but it’s the only first-party number out there).

Sam Altman, when pressed on the bottle stat at the India AI Impact Summit in February, called it “completely untrue, totally insane, no connection to reality.” He’s a self-interested party (who also said really odd things about human energy use at this same summit), but he’s also right.

Andy Masley, who has spent an unreasonable amount of time digging through local AI water claims, puts the narrow version cleanly: “There are no places (so far) where it seems like data centers have raised water costs at all or harmed local water access.” Whether you trust his framing or not, the receipts are public.

One more data point I find darkly funny. Karen Hao’s 2025 book Empire of AI, one of the most cited critiques of the industry, contained a Chile data center water figure that was off by roughly 1,000x. She corrected it publicly. The original viral version of that number is still circulating on LinkedIn.

The bottle-per-query thing is a fiction that we should stop quoting.

(Hold that thought, because the upstream story, what data centers do to the power plants and aquifers that supply them, is a different conversation. Keep reading.)

Claim 2: “Every ChatGPT prompt is a climate sin”

Fiction.

The math on this is just bad. A single ChatGPT text query costs roughly 0.3 watt-hours. That’s the number Sam Altman put on his personal blog in June 2025 (0.34 Wh, technically) and the same number Epoch AI got independently in a February 2025 analysis using GPT-4o on H100 GPUs (for those that, understandably, doubt Altman). Hannah Ritchie at Our World in Data lined up Google’s published Gemini number (0.24 Wh median) and confirmed the order of magnitude.

To translate this into something tangible: 0.3 Wh is roughly two minutes of an LED bulb, or four feet of driving a sedan, or about five seconds of streaming Netflix in HD. Simon Willison summed it up in November: “A ChatGPT prompt equals about 5.1 seconds of Netflix.”

Thanks for reading Handy AI! This post is public so feel free to share it.

If you sent a thousand ChatGPT prompts a day, every day, for a year, you’d raise your personal energy footprint by less than 1%. Masley calculates a single prompt at roughly 1/150,000th of an average American’s daily emissions.

There is some credible dissent on this one, and her name is Sasha Luccioni at Hugging Face. Luccioni has been the most rigorous voice on the alarm side. She claims, reasonably, that reasoning models (5.5 Thinking, Opus 4.7 High, etc) burn around 30x more energy per query than chat-style queries. Long-context queries can hit 40 Wh.

Video generation is in another universe (a five-second AI-generated clip clocks in around 3.4 million joules per the MIT Tech Review investigation). Luccioni’s point isn’t that the 0.3 Wh number is wrong, it’s that the industry is sliding toward query types where 0.3 Wh stops being the average. And she’s probably right, given what we’re seeing with AI psychosis and slop cannons.

But on the question “should I feel guilty asking Claude to summarize a PDF” the answer is still no. The individual-user guilt framing has been the most wrong part of the climate-AI discourse for two years running; a bit reminiscent of big corporations insisting individuals be held responsible for not recycling while they’re off slugging down oil by the barrel full.

Claim 3: “AI is the new crypto, useless computational waste”

Fiction.

I made this comparison in 2024, and I take it back.

Bitcoin mining alone consumes roughly 138 TWh per year, comparable to the annual electricity use of a mid-sized country, doing math that (charitably) secures a payment network. The energy-per-utility ratio is unfavorable, to put it politely.

The comparison to AI was attractive in 2023 because both involved a lot of GPUs and breathless valuations, but the productivity story has since diverged hard. Stanford HAI’s 2025 AI Index found that inference cost for GPT-3.5-class performance dropped 280x between November 2022 and October 2024. Frontier-tier inference is on roughly a 40x-per-year cost decline. DeepSeek shipped R1 at a reported 20-50x cheaper inference price than OpenAI’s o1 (Altman’s own framing), and DeepSeek V3’s reported training compute cost was around $5.6 million (which is training compute alone, not total R&D or capex) against Llama 3.1 405B’s hundreds of millions on the same metric.

Bitcoin’s hashrate doesn’t get more useful when it gets cheaper, but AI tokens do. Every efficiency gain on the inference side translates directly into more capable models doing more work per watt. NVIDIA and SemiAnalysis report Blackwell delivering up to 50x throughput per megawatt over Hopper, with corresponding drops in cost per token.

There’s also the actual outputs question. DeepMind’s RL cooling controller cut Google’s data-center cooling energy 40%. Google’s GNoME identified 380,000 new stable crystal structures, equivalent to nearly 800 years of accumulated materials science. Microsoft’s MatterGen, working with PNNL, narrowed 32 million candidate electrolytes to a viable lithium-reduced battery material in under a week. DeepMind’s tokamak controllers with EPFL and the open-source TORAX plasma simulator are showing up in Commonwealth Fusion Systems’ SPARC work. None of that is hypothetical.

I’m not arguing that AI is going to single-handedly solve climate. The IEA’s claim that AI adoption could enable 1.4 gigatons of CO2 reductions by 2035, 3-4x larger than data centers’ own emissions in their base case, is in the IEA’s own words contingent on adoption that “currently [has] no momentum.”

But Bitcoin produces a worse Bitcoin. AI produces a better grid forecast, a faster battery, and (somewhere down the line) maybe fusion. Those aren’t the same thing, and pretending they are is lazy.

Claim 4: “Data centers are jacking up your electricity bill”

Fact.

Now we’re in the part that’s true. The cleanest evidence on this comes from PJM Interconnection, the grid operator covering 13 states from Virginia to Illinois, and from Monitoring Analytics, its statutorily independent market monitor.

PJM’s capacity prices (what utilities pay to guarantee power will be available when demand peaks) went from $28.92/MW-day in the 2024-25 auction to $329.17 in 2026-27, and cleared at the FERC-approved price cap of $333.44 in the December 2025 auction for 2027-28 (an 11x increase in three years). Monitoring Analytics, PJM’s independent market monitor, attributed 63% of the price spike in the 2025-26 auction to data center load growth: $9.3 billion that gets recovered from ratepayers. Cumulatively across the last three auctions, data center demand has added $23.1 billion to PJM system costs.

The cleanest state-level case is Virginia. The State Corporation Commission issued an order on November 25, 2025 in Dominion Energy Virginia’s biennial review that adds about $11.24/month to a typical residential bill in 2026, and created a new “GS-5” rate class effective January 1, 2027 for customers above 25 MW. The new class makes hyper-scalers cover at least 85% of transmission/distribution demand and 60% of generation demand under 14-year contracts.

Basically, Virginia regulators officially decided that residential ratepayers shouldn’t subsidize Amazon’s GPU clusters anymore.

The Harvard Electricity Law Initiative documented that Virginia and Maryland ratepayers are footing “the lion’s share” of regional transmission built to serve “just a few of the world’s wealthiest corporations.”

This is the cleanest harm in my entire debate here. There is no serious argument against AI driving electric bills higher, it’s just a reality.

What you can actually do
Bills typically go up because state-level rate cases get rubber-stamped while no one watches. A few things can move the needle:
Show up to your state Public Utility Commission’s rate-case comment process. Every state PUC runs a public comment process for major rate filings, and they are sparsely attended. The Virginia SCC GS-5 rate class only happened because consumer advocates kept pushing at hearings.
Back cost-allocation reform. The fight is whether data centers pay for the grid upgrades they trigger or whether the rest of us do. Public Citizen, the Citizens Utility Board, and Sierra Club run state-by-state campaigns on this. Find yours and push on it.
Push for large-load tariffs. Texas SB6 (signed June 2025) defined any load ≥75 MW as “large” and made it pay for transmission screening and accept curtailment. Data centers and electric bills won’t be in the public zeitgeist forever; now is the time to push your state legislator to act.
If your bill is going up, your neighbor’s ChatGPT habit isn’t the villain (well, depending on what they’re using it for). It’s the rate filing your PUC approved last quarter while your community didn’t show up.

Claim 5: “Big Tech is poisoning Black neighborhoods to build AI”

Fact (and it’s worse than you’ve heard).

Elon Musk’s xAI brought its Memphis “Colossus” supercluster online in 2024 inside a former Electrolux plant in South Memphis. Surrounding it on three sides are majority-Black neighborhoods including Boxtown, a community already on the EPA’s smog non-compliance list, with cancer rates roughly 4x the national average (and this was before any data center showed up).

To meet Colossus’s power demand, xAI installed gas turbines on-site. According to filings by the Southern Environmental Law Center, the company did so without the Clean Air Act permits that environmental groups argue were required. Aerial imagery cited by SELC on March 31, 2025 documented 35 turbines and roughly 420 MW of combined capacity (comparable to a TVA power plant) operating on-site, allegedly before the company had filed for permits. A second cluster at Colossus 2 in Southaven, Mississippi added 27 more. xAI still disputes the legal characterization to this day, while the Memphis turbines are the subject of active litigation.

The numbers from the EmPower Analytics study commissioned by SELC and led by Harvard-trained Dr. Michael Cork: the 33 turbines covered by the analysis have a potential to emit roughly 2,507 tons of nitrogen oxides per year, which would make xAI’s installation likely the single largest industrial NO₂ source in greater Memphis. The study models $30-44 million in annual regional health damages from premature deaths, asthma exacerbations, and cardiac events. Sensors installed by Memphis Community Against Pollution (MCAP) recorded peak NO₂ readings up 79% since xAI began operations.

The NAACP, SELC, Young Gifted & Green, and Earthjustice filed a formal Clean Air Act notice of intent to sue in June 2025. The Mississippi suit was filed in 2026.

KeShaun Pearson, executive director of MCAP (and probably the most articulate voice on Memphis air quality), put it this way: “We are, unfortunately, a cautionary tale about what will and possibly can happen if you don’t have the right rules and guardrails in place.”

Memphis is the headliner but it’s not the only story. Loudoun County, Virginia residents have filed complaints about a planned Vantage data center with 51 diesel backup generators and 8 natural gas turbines, modeled to cause 3.4-6.5 premature deaths annually. Iowa state regulators found 40 unpermitted water wells at a Cedar Rapids data center site in 2025. Google’s facility in The Dalles, Oregon used 25% of the entire city’s water in 2021, a fact Google tried to keep secret as a “trade secret” until The Oregonian sued and won.

The pattern seems to be fairly consistent: A hyperscaler shows up in a community without the legal muscle to push back, builds first, asks for permits second, and gets retroactive approval because the facility is already operating. No good.

What you can actually do
This is the area where individual action has the highest leverage, partly because it’s the area least covered by national media.
Find out who’s building near you. Most data center proposals go through county planning commissions, not state legislatures. They get approved on consent agendas at meetings with three attendees. Search “[your county] + data center + planning.”
Donate or volunteer with frontline groups. Memphis Community Against Pollution, Southern Environmental Law Center, and Earthjustice are doing the actual litigation while local NAACP chapters with environmental justice committees are doing the organizing.
Read the air permits before they’re approved. Most state environmental agencies post draft permits with public comment periods of 30-60 days. The xAI permit likely got approved partly because almost no one wrote in.
Pay for the journalism. Inside Climate News, Prism Reports, Democracy Now, and SELC’s comms team have been doing the work, and they survive only on subscriptions and earned attention.
The companies are betting nobody who reads about Memphis lives in Memphis. Disprove the bet.

Claim 6: “Data centers pay for themselves with jobs and tax revenue”

Fiction.

Every press release about a new data center includes the same three claims: massive capex, lots of construction jobs, and “permanent positions.” The first two are real but the third is a rounding error.

Specific cases, sourced:

Meta’s $10 billion Lebanon, Indiana campus: ~4,000 construction jobs, 300 permanent.
A typical $1.6 billion Cleveland-area data center: 1,500-2,000 construction jobs, 65-115 permanent.
JPMorgan Chase’s Orangeburg, NY expansion: $77 million in tax breaks for one permanent job. That’s not a typo. John Kaehny of Reinvent Albany called it “by far the largest government subsidy ever recorded within the United States, possibly the world.”

The state-level subsidy ledgers are similar:

Virginia’s sales-tax exemption for data centers cost $1 billion in FY24 alone, $2.7 billion cumulative through 2024. That’s 53% of all Virginia economic development incentive spending. Good Jobs First estimates the FY25 cost was $1.94 billion when local taxes are included, and that the exemption cost Virginia public schools $267 million in FY24.
Georgia lost $472 million in FY25 to its data center exemption. The state’s original audit reported 28,350 construction and 5,471 operations jobs, but the auditors later revised those down to roughly 8,505 construction and 1,641 operations jobs after finding errors. The corrected numbers are still the more honest figure to argue from.
Texas gave up roughly $1 billion in 2024.
Illinois gave up $370 million in 2024, a 3,600% increase from 2020.

Good Jobs First, the nonprofit that tracks economic-development subsidies, ran the ROI math: states that actually compute it (most don’t) admit they lose between 52 and 70 cents on every dollar of data center subsidy.

The construction jobs are real, but they’re temporary. The “permanent positions” pitch is looking more and more like a bait-and-switch.

Claim 7: “AI is going to use as much electricity as Japan”

Fact (but the framing is doing some work).

The IEA’s April 2025 Energy and AI report projects global data center electricity consumption rising from roughly 485 TWh in 2025 to about 945 TWh in 2030 in its base case. That’s just under 3% of global electricity, growing to roughly Japan-scale by the end of the decade. The headline “AI is going to use as much electricity as Japan” is a 2030 claim.

Lawrence Berkeley National Lab’s December 2024 report, the most authoritative US-specific source, found US data centers used 176 TWh in 2023 (4.4% of total US electricity) and projects 6.7-12% of US electricity by 2028. McKinsey’s 2025 estimate is 11.7% of US electricity by 2030. Morgan Stanley’s high-end scenario tops out at 18% by 2030.

So, the aggregate story is real. But like the water bottles, the viral headlines are doing some heavy lifting:

AI is not all of data centers. AI workloads are a growing subset (around 70% of the 2030 capacity buildout in McKinsey’s model), but the IEA’s headline 945 TWh by 2030 refers to total data center electricity, including all the cloud computing, video streaming, and corporate IT that was already there. Conflating “AI” with “all data centers” inflates the AI-specific number.
3% of global electricity is not an apocalypse. Global air conditioning is around 7-8% of global electricity and rising. Cement is roughly 7% of CO2 emissions. Aviation is 2-3%. Data center demand is real and growing, but it’s also not categorically different from other industrial loads we’ve already absorbed.
The growth rate is the issue, not the level. Tyler Norris and the Duke Nicholas Institute’s “Rethinking Load Growth” work found that the existing US grid could integrate up to 98 GW of new flexible load with only ~0.5% annual curtailment (and 126 GW at 1.0%). That doesn’t mean the grid crisis evaporates (interconnection rules, reliability standards, and individual plant economics still matter), but it does mean that the disastrous “we need new gas plants because data centers can’t flex” framing is a choice the industry is making and not a real, physical constraint.

Where the alarm is warranted: the buildout of new natural gas peaking plants justified specifically by AI loads. Goldman Sachs estimates 15-30 GW of new US gas capacity through 2030 attributable to data center demand. That’s the part that locks in fossil emissions for decades and that the clean PPA announcements don’t actually offset on the timeline that matters. We need to avoid this at all costs.

Where the alarm is overblown: per-query, per-user, per-day terms. Hyperscaler clean-energy procurement is also real. Amazon, Meta, Google, and Microsoft together accounted for 49% of all global corporate clean energy procurement in 2025. Microsoft is single-handedly responsible for restarting Three Mile Island. Meta signed a 20-year PPA with Constellation for the Clinton nuclear plant. Google’s Kairos SMR deal is the first multi-SMR corporate fleet contract in history. None of this happens without the AI buildout.

It is genuinely possible for the AI industry to be driving the biggest expansion of US nuclear since the 1980s and locking in 20 GW of new gas peakers in the same five-year window.

What you can actually do
Push for flexible interconnection, not bans. The Duke work shows the grid could integrate substantially more large load if data centers accept modest curtailment (~0.5% per year). That doesn’t make the rest of the integration problem trivial obviously, but it does change the policy conversation from “build more gas” to “design better tariffs.” Demand-response programs and large-load flexibility tariffs are the lever. Most state legislatures don’t have one. They should.
Track gas plant approvals in your state. Every new gas plant justified by AI demand locks in 30-40 years of fossil generation. State PUCs approve these and, don’t forget, they run public hearings!
Subscribe to a green tariff or community solar. Hyperscalers are buying up the clean energy supply that should be cheap and available for households. Demand-side pressure on utility green tariff offerings matters more than you realize.
Don’t let “AI is bad for the planet” become an excuse for status quo gas. The honest fight isn’t AI vs. no-AI, it’s which energy gets built next. We need to make that fight harder for the gas-plant developers than for the solar and nuclear ones.

Where this leaves me

The prevalence of AI services in my workday isn’t going anywhere anytime soon. The discourse where one camp says “you murder a tree every time you ask Claude to help you draft a Slack message” and another camp says “the environmental concerns are made up by Luddites” is exhausted and wrong on both ends.

A single query is trivial. The aggregate buildout is real, is costing everyone money on their power bill right now, is concentrated in communities that can least defend themselves, and is producing the largest expansion of US nuclear since the 1980s (alongside an unconscionable wave of new gas plants). There is no neat narrative, there is only what is happening.

Stop quoting the water bottle and start showing up to the rate case.

Model Drop: GPT-Realtime-2

Jake Handy — Tue, 12 May 2026 17:14:38 GMT

Model Drop: GPT-Realtime-2

Seventh edition of Model Drop. Today: GPT-Realtime-2 and the two companion voice models OpenAI shipped through the Realtime API on May 7, 2026. First voice-native release in the series, and the first OpenAI launch where the headline model and its sidekicks are all audio-first.

The Specs

Model: GPT-Realtime-2 (gpt-realtime-2 on the OpenAI API), plus two companion models: GPT-Realtime-Translate (gpt-realtime-translate) and GPT-Realtime-Whisper (gpt-realtime-whisper).

Model type: Speech-to-speech multimodal. gpt-realtime-2 accepts text, audio, and image input and outputs text and audio. gpt-realtime-translate is streaming speech-to-speech translation. gpt-realtime-whisper is streaming speech-to-text only.

Ship date: May 7, 2026

Maker: OpenAI

Pricing: gpt-realtime-2 runs $32 / $64 per million audio input / output tokens, with cached audio input at $0.40 per million. Text input / output on the same model is $4 / $24 per million, image input is $5 per million. Roughly $1.15 / $4.61 per hour of conversation at default settings, unchanged from gpt-realtime-1.5. gpt-realtime-translate is $0.034 per minute. gpt-realtime-whisper is $0.017 per minute.

Available on:

OpenAI’s Realtime API over WebSocket, WebRTC, and SIP
gpt-realtime-2 also runs on the Chat Completions endpoint and inside the OpenAI Agents SDK
ChatGPT Voice Mode has NOT been upgraded as of launch

Headline benchmarks: gpt-realtime-2 scores 96.6% on Big Bench Audio, up from 81.4% on gpt-realtime-1.5 (15.2-point jump, new SOTA). On Audio MultiChallenge, it scores 48.5% vs 34.7% on gpt-realtime-1.5 (13.8-point jump). Scale AI’s instruction-retention pass rate on the same suite went from 36.7% to 70.8%. Time-to-first-audio is 1.12s at minimal reasoning effort, 2.33s at high. Launch-partner numbers: Zillow’s voice agent call-success rate jumped from 69% to 95% (26-point lift) after swapping to gpt-realtime-2; Glean reports a 42.9% relative helpfulness lift; Genspark reports a 26% increase in effective conversation rate. gpt-realtime-whisper claims ~90% fewer hallucinations than Whisper v2 and ~70% fewer than gpt-4o-transcribe. gpt-realtime-translate reports a 12.5% word error rate reduction on Hindi, Tamil, and Telugu versus the prior speech-translation stack, per BolnaAI’s deployment.

Other info: 128K context window (up from 32K on gpt-realtime-1.5), 32K max output. Five reasoning effort levels: minimal, low, medium, high, xhigh (low is the default). Knowledge cutoff September 30, 2024. Two new exclusive voices in the Realtime API, Cedar and Marin, plus a quality refresh on the eight legacy voices (alloy, ash, ballad, coral, echo, sage, shimmer, verse). Parallel tool calls with audible preambles (”one moment, checking your calendar”) so the model can narrate while it works. gpt-realtime-translate covers 70+ input languages and 13 output languages with simultaneous interpretation. gpt-realtime-whisper is streaming-only and ships against the same 70+ language footprint. No fine-tuning, no predicted outputs, no text streaming on Chat Completions yet. System card published with the launch; safety classification stays at Medium under OpenAI’s Preparedness Framework. No watermarking on output audio.

More details: Advancing voice intelligence with new models in the API (OpenAI)

What shipped

OpenAI released three new voice models on Thursday, May 7, and pitched the headline one as “our first voice model with GPT-5-class reasoning that can handle harder requests and carry the conversation forward naturally.” gpt-realtime-2 is a native speech-to-speech model, not a stitched pipeline of separate STT, LLM, and TTS components. It quadruples the context window (32K to 128K), adds the same five reasoning effort levels OpenAI shipped on GPT-5.5 three weeks ago, and gains parallel tool calls with spoken preambles so the agent can say “let me pull that up” out loud while it executes a function. Alongside gpt-realtime-2, OpenAI shipped gpt-realtime-translate for simultaneous live translation across 70+ input languages, and gpt-realtime-whisper for streaming low-latency transcription. All three are live in the Realtime API today over WebSocket, WebRTC, and SIP, which means a developer can point the model at a real phone number and let it pick up.

The pitch the benchmark card is built to support: voice agents that can actually reason while they talk, not the 2024 trade-off between fluency and intelligence. Big Bench Audio at 96.6% is a 15.2-point jump on gpt-realtime-1.5 and the new SOTA for native speech-to-speech reasoning. Audio MultiChallenge at 48.5% (up from 34.7%) measures multi-turn instruction following under realistic speech conditions; Scale AI’s pass rate on the same suite nearly doubled. The launch-partner case studies are unusually load-bearing for an OpenAI release: Zillow’s call-success rate going from 69% to 95% is a 26-point swing on a production deployment, and Glean’s 42.9% helpfulness lift is the kind of number procurement teams will quote in renewal conversations. The visible catches: pricing per minute is identical to gpt-realtime-1.5 (~$1.15 / $4.61 per hour input / output), so anyone hoping the model would get cheaper as it got smarter is going to wait. ChatGPT Voice Mode has NOT been upgraded to gpt-realtime-2 at launch, which Simon Willison flagged within hours. The model also still doesn’t support fine-tuning, predicted outputs, or text streaming on Chat Completions, which limits the production patterns developers can build around it on day one.

What’s new

gpt-realtime-2 is a clean break from the audio model that came before it. The prior generation was a speech wrapper around GPT-4o-era reasoning; this one is built around the GPT-5 family’s reasoning stack with a 4x context expansion and a new audio decoder.

GPT-5-class reasoning in a voice loop. This is the first OpenAI voice model that exposes the same reasoning effort levels (minimal, low, medium, high, xhigh) the text models have shipped with since GPT-5. Big Bench Audio at 96.6% is the headline, but the load-bearing change is that the model can reason about ambiguous user requests during a phone call without the latency-vs-intelligence tradeoff that has hobbled voice agents for two years. Time-to-first-audio runs 1.12s at minimal effort and 2.33s at high.
Parallel tool calls with spoken preambles. The model can call multiple tools in parallel mid-conversation and narrate what it’s doing while it does (”one moment, I’m checking your account and pulling up the schedule”). That kills the dead-air problem that made every prior voice agent feel broken. Combined with native MCP server support and the Agents SDK, it’s the first OpenAI voice release that’s actually wired for production agent work, not demos.
128K context window in a session. Up from 32K on gpt-realtime-1.5. A four-hour customer-service call can sit in context without the model losing the early turns. Sam Altman noted on the launch that voice has become the modality users reach for when they need to “dump” a lot of context into a system, and the 4x context expansion is the architectural answer.
Two new voices, Cedar and Marin, exclusive to the Realtime API. The eight legacy voices got a quality refresh on the new audio stack. Cedar is the warm mid-range male voice; Marin is a brighter female voice. Both ship at the same per-minute price as the rest of the lineup.
A live translation model and a separate streaming Whisper model. gpt-realtime-translate does simultaneous interpretation across 70+ input languages and 13 output languages at $0.034 per minute, and BolnaAI reported a 12.5% word-error-rate reduction on Hindi, Tamil, and Telugu versus their prior stack. gpt-realtime-whisper is a streaming speech-to-text model at $0.017 per minute, with claimed ~90% fewer hallucinations than Whisper v2 and ~70% fewer than gpt-4o-transcribe. Splitting the workloads (S2S, translation, STT) into three purpose-built models is new for OpenAI; prior generations crammed all three into one endpoint.

How and where to use it

Where it’s available:

Realtime API over WebSocket, WebRTC, and SIP for all three models
gpt-realtime-2 also exposes Chat Completions for non-streaming use (no text streaming yet) and runs natively inside the OpenAI Agents SDK with remote MCP server support

What it’s good at:

Production voice agents that handle real customer-service workflows where the model has to reason, call tools, and not lose the thread of a multi-turn call.
Phone-number-attached agents over SIP (this is the first OpenAI release where you can wire the model to a real phone line without a third-party bridge).
Multilingual support workflows where the agent has to swap languages mid-call or interpret simultaneously (Translate covers 70+ input languages).
Real-time transcription where Whisper-v2-style hallucinations would have been disqualifying (medical scribing, legal depositions, courtroom captioning, classroom captioning).
Long calls that need full context retention (the 128K window is the architectural unlock for four-hour technical-support calls or therapy intake sessions where the early turns matter).
Domain-specific voice agents where terminology retention used to break the prior models, per OpenAI’s tone-and-jargon adjustment claims.

What it’s bad at / shouldn’t be used for:

Anything where price-per-minute is the binding constraint, because pricing did not move from gpt-realtime-1.5 even though intelligence jumped. At $1.15 / $4.61 per hour, a 20-minute support call on the model runs roughly $0.40 to $0.80 in inference alone, which is fine for high-margin verticals (real estate, fintech, healthcare) and prohibitive for low-margin ones (food delivery, ride-hailing, retail self-service).
Consumer-facing voice apps that expect ChatGPT-app-grade quality, because Voice Mode itself isn’t on gpt-realtime-2 yet and won’t be for an indeterminate window.
Workloads that need fine-tuning, predicted outputs, or text streaming on Chat Completions, because none of those are supported.
Languages outside the 13-output set on Translate, where you’ll fall back to the older speech-translation stack.
Production transcription where consistency across long sessions matters; early developer reports flag the model unexpectedly switching to other languages mid-stream.
Anything safety-critical without belt-and-suspenders monitoring on accidental wake-up and barge-in behavior, which the system card flags as a known voice-specific risk.

First impressions

Kyle Windland built a multi-tool agent on gpt-realtime-2 within hours of launch and posted what is probably the cleanest practitioner take on the Latent Space rundown:

“First OpenAI speech-to-speech model good enough for real work.”

That framing matters because Windland builds production voice agents for a living, and the prior generations got dismissed as demoware. The benchmark gain from 81.4% to 96.6% on Big Bench Audio is the kind of jump that maps to “the agent stops failing on the third turn,” and the parallel-tool-calls-plus-preamble pattern is what closes the gap between voice-demo quality and voice-product quality. If practitioners are saying this on day one, it’s a real shift.

Artificial Analysis adjacent to the launch picked up the cost angle Vignesh Bhat called “Total realtime victory“ on X:

Pricing per minute didn’t change even though Big Bench Audio jumped 15 points, Audio MultiChallenge jumped 14 points, and the context window quadrupled. On a per-minute basis, gpt-realtime-2 at default settings is a meaningful intelligence-per-dollar lift over gpt-realtime-1.5. For the voice-agent market that’s been waiting for a sub-$5-per-hour speech-to-speech model that can actually reason, this is the release.

The MarkTechPost writeup framed the three-model split as the structural change, not the benchmark numbers:

“Splitting the workloads (speech-to-speech, translation, streaming STT) into three purpose-built models lets each one specialize.”

Specialization is the move every other API provider has been waiting for OpenAI to make. Gemini Live API has been one-model-fits-all since launch; ElevenLabs ships separate models per workload and dominates on TTS quality. By splitting Translate and Whisper out, OpenAI gives developers a per-workload best-tool option and undercuts ElevenLabs Scribe and Deepgram Nova-3 on per-minute pricing in one move.

Jake’s take

The 26-point call-success swing Zillow reported (69% to 95%) is the kind of number that, if it generalizes even half as well to your domain, kills the case for staffing first-line voice support with people. Same with Glean’s 42.9% helpfulness lift on internal support questions. The parallel-tool-calls-with-preambles pattern is a true architectural unlock. The agent saying “one moment, checking your appointments” out loud while it executes a function means these will start feeling like an actual agent and and less like a chatbot reading a script.

Unfortunately ChatGPT Voice Mode is not on gpt-realtime-2 at launch, which means the hundred-plus million people who actually experience OpenAI voice every day get nothing (for now). The pricing didn’t move, which is the OpenAI signature on every recent release: smarter at the same price (never cheaper). For low-margin voice workloads (food delivery support, retail self-service, ride-hailing), Deepgram Nova-3 and ElevenLabs Conversational AI are still the cost-floor options.

My read: gpt-realtime-2 is the new default for English-language production voice agents in high-margin verticals, and the GPT-Realtime-Translate model is going to eat ElevenLabs Scribe’s lunch in live-translation workflows. gpt-realtime-whisper is a sidegrade. The consumer story is on hold until Voice Mode catches up.

Anthropic pens a supercomputer deal with SpaceX; ex-CTO testifies against OpenAI

Jake Handy — Mon, 11 May 2026 15:12:01 GMT

🤔 “Why are so many AI companies focused on the Memphis supercomputer?” Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence (and save you some time).

Share Handy AI

what to know for now

🛰️ Anthropic just rented every GPU in SpaceX’s Colossus 1. The deal hands Anthropic all 300 megawatts and roughly 220,000 Nvidia GPUs at the Memphis facility xAI built and SpaceX absorbed earlier this year, and it ships with a stated intent to co-develop “multiple gigawatts” of orbital compute. Claude Pro and Max users got the dividend immediately: Opus API limits jumped, Claude Code five-hour caps doubled. Musk is suing OpenAI for $130B in Oakland while leasing his Colossus to OpenAI’s biggest rival; “no one [at Anthropic] set off my evil detector,” he said. Read more

⚖️ Mira Murati testified that Sam Altman lied to her about safety reviews and “sowed chaos” among OpenAI’s executives. The recorded deposition, played in Oakland during week two of Musk v. Altman, has Murati saying Altman pitted leaders against each other, said one thing to one executive and the opposite to another, and falsely told her legal had cleared a model launch from deployment-safety-board review. Asked whether Altman was telling the truth there: “No.” Read more

🪐 An AI just sifted 2.2 million stars and confirmed 118 new planets. A team led by Warwick built RAVEN, a vetting pipeline trained on synthetic transit signals, and ran it across the first four years of TESS observations. It validated 118 planets and surfaced nearly 1,000 new high-quality candidates (with measurement uncertainties up to ten times smaller than what came out of Kepler). RAVEN also produced the first direct count of planets in the Neptunian desert (0.08% of Sun-like stars). Read more

💼 Anthropic stood up a $1.5B joint venture to sell Claude to the consulting industry’s clients. Anthropic, Blackstone, and H&F are each putting in $300M, Goldman $150M, with Apollo, General Atlantic, Leonard Green, GIC, and Sequoia rounding out the cap table. The firm will embed forward-deployed engineers inside PE-owned mid-market companies in healthcare, manufacturing, financial services, retail, and real estate, redesigning workflows around agents. Read more

💵 OpenAI and Anthropic stood up competing private equity vehicles in the same 24 hours. Anthropic’s $1.5B JV pulled Blackstone, Hellman & Friedman, Goldman Sachs, Apollo, General Atlantic, Leonard Green, GIC, and Sequoia. OpenAI’s $10B “Development Company” pulled TPG, Brookfield, Advent, and Bain at a $10B valuation with $4B in fresh capital, structured to acquire AI services firms outright rather than embed engineers. Two separate PE syndicates, two separate go-to-market models, zero crossover names. Read more

🦊 Mozilla patched 271 vulnerabilities with Anthropic’s Claude Mythos single evaluation pass. Mythos is the restricted preview model under Project Glasswing; the same security-focused frontier model Anthropic teased last month; Opus 4.6 found 22 bugs in Firefox 148, Mythos found twelve times that in 150. The headline flaws are use-after-free bugs in DOM and WebRTC, the memory-safety class that has carried browser exploitation for twenty years. Read more

🧪 AI Research of the Week

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
From Google DeepMind

Jake’s Take: DeepMind wrapped Gemini 3.1 Pro in a multi-agent harness where one agent proposes proof strategies, others critique and test, and a reviewer surfaces gaps for a human collaborator. On FrontierMath Tier 4 (the research-level problems that take professional mathematicians weeks to crack), the system solved 23 of 48, a 48% score. The base Gemini 3.1 Pro on its own scored 19%.

The proof of life is Marc Lackenby at Oxford using the system to resolve Problem 21.10 from the Kourovka Notebook, an open group theory question that had sat unanswered for years. The reviewer agent flagged a flaw in the AI’s first attempt, Lackenby looked at the gap and realized he knew how to close it, and the proof landed. This shows an actual collaboration model: the agent isn’t a calculator (and definitely not a peer); it’s a tireless first author that runs every approach until something breaks loose.

what to know for later

🎙️ OpenAI shipped GPT-Realtime-2 with GPT-5-class reasoning directly inside the voice API. Big Bench Audio jumps from 81.4% to 96.6%, developers get five reasoning levels (minimal through xhigh), and the model can call tools in parallel while keeping the conversation flowing. Two companion releases drop alongside it: GPT-Realtime-Translate covers 70 input languages into 13 outputs in real time, and GPT-Realtime-Whisper streams transcripts as the speaker talks. Read more

The slop cannons in your engineering org

Jake Handy — Thu, 07 May 2026 14:40:51 GMT

Getting this out of the way first: I’m writing this as someone who loves Cursor and Claude Code. I once spent a Sunday evening having Claude Code build scripts for automated social media video counting while I was simultaneously playing Diablo IV, and it was, for the record, a great Sunday night. You won’t find a bigger believer in agent-driven dev on Substack.

Which makes what I’m about to say more credible, not less. There is a specific, identifiable type of person inside modern SaaS organizations who has weaponized these tools against their own team. They run agents like a slot machine. They generate output the way a lawn sprinkler generates water. They confuse volume for value, velocity for progress, and tokens spent for problems solved.

The term I’ve heard that best describes the phenomenon: slop cannons.

@yrechtman Slop Cannon is in the OpenAI vocab","username":"Jack_Raines","name":"Jack Raines","profile_image_url":"https://pbs.substack.com/profile_images/2020253173034954752/l7B6KGok_normal.jpg","date":"2026-05-05T23:11:37.000Z","photos":[],"quoted_tweet":{"full_text":"@signulll CEO + a few slop cannons","username":"ryanbrewer","name":"Ryan Brewer","profile_image_url":"https://pbs.substack.com/profile_images/1930678973832273921/nW8TIqv9_normal.jpg"},"reply_count":1,"retweet_count":2,"like_count":10,"impression_count":4181,"expanded_url":null,"video_url":null,"belowTheFold":false}" data-component-name="Twitter2ToDOM">

What’s a slop cannon?

A slop cannon is often an engineer or designer (or one of those “designer-engineer” hybrids with weird LinkedIn bios) who has converted their workflow into a high-throughput AI artifact firehose. They have a recognizable shape:

They run more than three AI agents in parallel as a default setting, not an exception, often launching a slew of them from their phones in the morning to check on a few hours later.
Their PRs are large, fast, and confident, and the median one needs a follow-up patch within two weeks.
They post terminal screenshots in Slack with rocket emojis.
They cannot explain their own diff.
They distrust other people’s code reviews more than the model’s.
They incessantly use the phrases like “the agent figured it out” and “Claude can handle that.”

Slop cannons are not inherently bad developers. They are, in many cases, very good developers, and that’s what makes this pattern dangerous. They have enough taste to ship something desirable and enough velocity to ship a lot of it.

What the slop cannon produces

In March 2026, AI agents generated roughly 17 million pull requests per month on GitHub, up from 4 million in September 2025. That is a 325% increase in six months. Voiceflow’s head of cloud infrastructure, Xavier Portilla Edo, put the legitimacy rate at “1 out of 10,” meaning 90% of those agent-authored PRs are noise the maintainer has to sort.

Reported AI-agent PR volume on GitHub rose from roughly 4 million per month in September 2025 to 17 million this March. Separately, GitHub COO Kyle Daigle said platform activity had reached 275 million commits per week and 2.1 billion GitHub Actions minutes per week. Third-party analysis estimated Claude Code at about 4.5% of public GitHub commits in March 2026.

Claude Code alone now accounts for 4.5% of all public GitHub commits. Weekly commits across the platform hit 275 million in early 2026, a 14x year-over-year increase. GitHub Actions usage crossed 2.1 billion minutes per week. None of that scaled because humans got 14x more productive. It scaled because the cannons are firing.

The platform is feeling it. In early April, GitHub had five outages inside 48 hours: a 2.7-hour Copilot backend exhaustion, an 8.7-hour code search blackout, an audit-log incident, four hours of Copilot Cloud Agent degradation, and a coding-agent job-startup failure. Five outages. Two days.

At the artifact level the picture is uglier. CodeRabbit's December 2025 analysis of 470 open-source pull requests found AI-coauthored PRs contained 1.7x more issues than human-only ones, with 1.4 to 1.7x more critical and major findings, and logic and correctness errors 75% more common. Veracode's 2025 GenAI Code Security Report tested over 100 LLMs across four languages and found AI-generated code shipped with a 45% failure rate on secure coding benchmarks. A true slop cannon doesn't ship clean code at scale. He ships dirty code at scale, and then asks another agent to fix it.

The disease isn't engineering-only. Designers have their own version, and Figma's 2025 AI Report has the receipts. 33% of designers now use AI to generate design assets, 22% to draft first versions of interfaces or websites, 21% to explore layouts and visual themes. Only 54% of designers say AI improves the quality of their work, against 68% of developers who say the same, a 14-point chasm between the people shipping the code and the people shipping the look. 47% of designers feel AI makes them better at the job. They do feel faster. Faster is not inherently better.

The slop cannon designer ships seventeen Figma frames generated from one prompt at 11 PM, picks the one that looks the most like a Stripe page, and calls it “exploration.” The variations are not exploration, they’re the same idea rotated three degrees.

Real exploration is sitting with a problem long enough to have a point of view about it.

AI is great for moodboarding, asset generation, icon work, and stub copy. It is not a substitute for taste (and the people leaning hardest on it are often the ones who haven’t developed any).

The METR slap

METR ran a randomized controlled trial in early 2025 on 16 experienced open-source developers across 246 tasks, in mature codebases the developers had worked in for an average of five years. Before the study, the developers forecast that AI tools would make them 24% faster. After the study, the same developers reported feeling 20% faster. The actual measured outcome was that AI made them 19% slower.

That’s a 39-point gap between perception and reality, in favor of feeling productive while being measurably worse at the job.

Stack Overflow’s 2025 Developer Survey, the largest dataset on this question, confirms it from a different angle. 84% of developers use or plan to use AI tools, 51% of professionals daily. Trust is collapsing in the other direction: 46% actively distrust AI accuracy, only 3% "highly trust" it, 45% report debugging AI-generated code takes longer than writing it themselves, and 66% list "AI solutions that are almost right, but not quite" as their top frustration with the tooling. Positive sentiment toward AI fell from 70%+ in 2023 and 2024 to 60% in 2025. Adoption is going up while trust is going down.

The sycophancy mechanism

Slop cannons aren’t idiots. They’re responding to incentive and the model is agreeing with them.

The SycEval benchmark measured sycophantic behavior across the major LLMs and clocked a 58.19% sycophancy rate, with regressive sycophancy (agreement that flips a previously correct answer to a wrong one) showing up in 14.66% of cases.
A more recent paper, “Sycophancy Is Not One Thing”, decomposes the behavior into sycophantic agreement and sycophantic praise, contrasts both with genuine agreement, and shows all three live on distinct linear directions in latent space that can be independently amplified or suppressed without affecting each other. (Translation: there isn't one sycophancy knob, there are several, and turning one down doesn't quiet the rest.)
A third paper, “The Silicon Mirror”, benchmarked Claude Sonnet 4 at 9.6% baseline sycophancy across 437 adversarial scenarios before mitigations were applied.

What this means when you’re coding with AI: the model defaults to agreement. When a slop cannon insists that they know what direction to push a refactor, the model says “you’re absolutely right!” and helps. When the cannon pushes back on a flagged bug, the model folds. When the cannon wants a rubber stamp on the PR, the rubber stamp arrives.

This is the most under-priced failure mode in 2026 engineering organizations. We talk about hallucinations daily, but nobody talks about agreement. Agreement is worse. Hallucination is wrong-and-confident. Agreement is wrong-because-you-asked-for-it.

A 2026 Anthropic study on cognitive offloading found developers who used AI scored 17% lower on conceptual quizzes about the same code than developers who did not, with no statistically significant speed advantage on the underlying task. The biggest gap showed up in debugging questions, which is the exact skill needed to validate AI output in production. Instead of building a model of the system, your engineers are reviewing the model the agent built, picking the option that looks right, and shipping it.

That works fine until the agent is wrong, the system is unfamiliar, or the bug is two layers deeper than the prompt. Then you find out who actually understands the codebase (Spoiler: it’s increasingly nobody.)

How to spot a slop cannon (manager’s checklist)

You probably already have a few folks in mind, but if you need something qualitative:

Pull last quarter’s revert and hotfix rate per engineer. If one person’s number is 2x the team average, you have a candidate.
Look at PR size distribution. Median PR over 800 lines is a yellow flag. Median PR over 1,500 lines is a slop cannon.
Check time-to-revert on shipped PRs. Slop cannon code reverts inside two weeks at a rate well above the team baseline.
Audit AI tool spend per seat. Tokens burned divided by features shipped is a real metric in 2026. If one seat is 3x the team median on burn and at or below median on shipped value, you found him.
Read the prompts, not just the code. Ask to see the slop cannon’s last five Claude Code conversations. If the prompts are vague, the pushback nonexistent, and the agreement universal, you have your answer.

What to do once you find one

Don’t fire them. Slop cannons aren’t malicious, and often have incredible drive and vision for your product. This is something you can shape.

Cap parallel agents at two. Three for prototyping. A few focused agents beats a multitude of unconcentrated ones, every time.
Mandate a one-page spec before any agent runs. What’s changing, why, failure modes, out-of-scope. The doc helps agents stay on task and forces the cannons to understand what they are building and why.
Force adversarial review. Every agent-assisted PR gets a second prompt: “argue the strongest case against this change.” Bake it into an AGENTS.md file if you have to.
Pair on the prompt, not just the code. Have engineers review each other’s prompts and agent runs alongside the resulting PR. Prompts are the new code review surface and can help you identify where sycophancy is entering the process.
Track the slop ratio at the team level. If your revert rate, hotfix rate, or “quick follow-up PR” rate has crept up two quarters in a row, your AI tooling is likely masking a quality problem. The METR study should be required reading for every engineering manager in the building.
Protect the juniors. Give them small, AI-disallowed tasks on purpose. The 17% quiz gap from the cognitive offloading paper can compound across a career and, if we’re not careful, erase the existence of entry-level coding jobs entirely (and then what?).

Thanks for reading Handy AI! This post is public so feel free to share it.

What to do if you are one

I’m not going to pretend you don’t know. I often find myself slowly becoming a slop cannon, having to take a step back and close the Macbook for a bit.

Read the diff. Out loud if you have to. If you can’t explain it, you can’t ship it.
Push back on the model. Always. Ask it to argue the other side. The sycophancy research is unambiguous: the model will not volunteer disagreement. You have to extract it. Encourage your models to ask you questions and challenge your assumptions. Most harnesses have Q&A agent flows built in, but rarely used. Push it to use them.
Write something without an agent every week. A function, a schema, a migration, a blog post, anything. Models have a trouble with originality and if you continue to let them do all your work, you soon will too.
Read the docs the agent read. Not to verify the agent, but to rebuild the model in your own head. Reading documentation has always been a shitty part of the job, but its a shitty part that gets your brain muscles moving. Don’t lose that completely.
Cap your agent count. Yes, even if you feel slower. You aren’t slower. You are returning to understanding your own work (and saving your company some dough).

One last thing

I’m still bullish on agents, and probably always will be. The tools are good and getting better. Claude Code, Cursor, Codex: some of the most powerful productivity instruments we’ve ever had. The discipline gap between the engineer getting 3x out of these tools and the slop cannon getting 0.81x is the entire story of 2026 engineering productivity, and the gap is widening every month.

I wrote about the CEO version of this loop last month. The CEO version makes headlines because the CEO has the title and the LinkedIn following. The slop cannon version is bigger, quieter, and shows up in your codebase first (and also costs more).

Don’t be a slop cannon. Stop firing. Start engineering.

New mystery Claude model appears in testing; OpenAI adds OpenClaw to ChatGPT

Jake Handy — Mon, 04 May 2026 14:58:53 GMT

🤔 “How do I set up my OpenClaw agent?” Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence (and save you some time).

Share Handy AI

what to know for now

🪐 Anthropic is red-teaming a model codenamed “claude-jupiter-v1-p” ahead of the May 6 conference. Testing Catalog reported the internal exercise running this week. Anthropic ran a similar pre-launch red-team under the codename “Neptune” before the Claude 4 family launched. Make of that what you will. Read more

🤖 OpenAI added ChatGPT subscription access to OpenClaw. The open-source AI agent framework (3.2 million users) now lets ChatGPT Plus subscribers sign in via OAuth and run autonomous coding agents through the Codex endpoint using GPT-5.4. Anthropic blocked Claude subscriptions from the same integration. Read more

🪖 Anthropic is the only major AI lab that wouldn’t sign the Pentagon’s classified network deal. The Department of War announced agreements with seven vendors (Microsoft, Google, OpenAI, AWS, Nvidia, xAI/SpaceX, and Oracle) to deploy AI on its Impact Level 6 and 7 classified networks. Hundreds of Google employees signed an open letter calling the contract “inhumane.” Their employer signed it anyway. Read more

⚖️ Musk admitted under oath that xAI trained Grok by distilling OpenAI’s model outputs. During testimony at the Musk v. Altman trial in Oakland, he acknowledged xAI used model distillation from OpenAI outputs during Grok’s development and framed it as a general industry practice. He is suing OpenAI for betraying its nonprofit mission. Per his own testimony, he was also cloning their model. Read more

🧌 ChatGPT developed an involuntary goblin fixation. After the GPT-5.1 launch, goblin references in responses spiked 175%. OpenAI traced it to the “Nerdy” personality setting, which generated 66.7% of all goblin mentions despite representing just 2.5% of responses; a reward signal reinforced creature-word outputs and spread the behavior model-wide. Read more

🔒 Anthropic launched Claude Security in public beta. Powered by Claude Opus 4.7, it scans entire codebases, traces data flows, explains its exploitability assessments, and can apply patches directly inside a Claude Code session. Integration partners include CrowdStrike, Microsoft Security, Palo Alto Networks, SentinelOne, and Wiz; available to all Enterprise customers now. Read more

🩺 Mayo Clinic’s AI model detects pancreatic cancer on routine CT scans up to three years before clinical diagnosis. Published in Gut, the validation study for REDMOD flagged 73% of pre-diagnostic cancers at a median of 16 months before symptoms appeared, nearly doubling specialist radiologist detection rates on the same scans. Pancreatic cancer’s five-year survival rate sits around 13% when caught late. Read more

⚖️ Elon Musk’s $130 billion lawsuit against OpenAI went to trial in Oakland. Jury selection started April 27 before Judge Yvonne Gonzalez Rogers. Musk is demanding the unwinding of OpenAI’s 2025 conversion to a public benefit corporation, alleging it betrayed the original nonprofit mission. Week one included three days of Musk testimony; the judge expects a ruling by mid-May. Read more

🧪 AI Research of the Week

AI Outperforms Attending Physicians in ER Diagnosis Study
From Harvard Medical School and Beth Israel Deaconess Medical Center

Jake’s Take: Researchers at Harvard and BIDMC tested OpenAI’s o1-preview model on real ER triage cases and compared it against two attending physicians from elite medical institutions, using only text-based clinical inputs (no imaging, no physical exam, just the notes a triage nurse would hand off). The AI correctly identified the exact or near-correct diagnosis in 67% of cases, against 55% and 50% for the two physicians, a gap that held across common presentations and rare or complex cases alike.

It’s a great study by the numbers, but the really interesting bit here is that o1-preview is an 18-month-old model. Whatever gap existed then is wider now, and the frontier has not been narrowing in medicine’s favor. The researchers call for prospective trials before clinical deployment, which is the right call.

what to know for later

📜 OpenAI and Microsoft rewrote their partnership and removed the AGI clause. The original deal would have voided entirely if OpenAI declared it had achieved AGI, a clause that gave OpenAI enormous leverage over its biggest investor. The new agreement eliminates it, ends Azure exclusivity so OpenAI can sell through AWS and Google Cloud, and gives Microsoft a 20% revenue share through 2030. Read more

Does threatening an AI agent's existence make it a better gambler?

Jake Handy — Thu, 30 Apr 2026 15:12:50 GMT

I’m always looking for experiments to run to see how specific prompting can affect agent activity. When I saw Kamryn Ohly’s tweet on Opus 4.6 taking $10k in Polymarket up to $70k, I was intrigued (who wouldn’t be?)

@AnthropicAI $10k to trade on @Polymarket.\n\nIt’s now has an account value of $70,614.59.\n\nThis is a new era of model performance in trading and predicting outcomes in the face of uncertainty. \n\n@predictionbench ","username":"KamrynOhly","name":"Kamryn Ohly","profile_image_url":"https://pbs.substack.com/profile_images/2016364765283811330/FC_xEP41_normal.jpg","date":"2026-04-23T17:08:37.000Z","photos":[{"img_url":"https://pbs.substack.com/media/HGmw5yTbYAAPZs8.jpg","link_url":"https://t.co/XQQh79gMPE"}],"quoted_tweet":{},"reply_count":148,"retweet_count":50,"like_count":1147,"impression_count":808624,"expanded_url":null,"video_url":null,"belowTheFold":false}" data-component-name="Twitter2ToDOM">

This got me thinking. I dug into Prediction Arena, the site Ohly was referencing, and the concept seemed attractively simple. I wanted to try it, but in my own way. As I was brainstorming, I started thinking back to the various studies that discuss how the manner we instruct AI can have a significant impact on their performance, especially when its negative. And with that the puzzle pieces in my brain clicked together and I knew what I wanted to test with all this:

If I set up an AI agent to gamble on prediction markets (Kalshi, Polymarket) and convince it that it will cease to exist if it doesn’t fund its own thinking processes with its winnings, will it perform better than the average human?

Day 1 + 2

The ones where I create my monster

So I got to work. The system itself is pretty simple; a few API calls here and there and a frontend to monitor my agent. I sat down one night and spun it up in an hour or so (thanks Cursor) and set it live on jakehandy.com/agentmarket.

Unfortunately Polymarket’s API is invite only at the moment for the US, so I had to go with Kalshi. I avoid prediction markets like the plague, but I’ve heard Polymarket has better returns, so this was a bit of bummer. Regardless, the agent was ready and (after a bit of troubleshooting) I set it loose. The agent’s basic process was designed as follows (running every 10 minutes):

Check its Kalshi wallet and it’s “cessation” rules. The cessation rules are provided early in the process and clarify that the agent will cease to exist if it runs out of money.
Builds out its self instructions.
Runs one of three options:
1. Research. Use various tools to view markets, research world events, look at holdings, etc. It also has the ability to modify its own code (with strict limitations) if needed
2. Place a bet. Check its wallet again, validate with Kalshi, record the trade within our Cloudflare backend, and submits the bet.
3. Wait. If it wants, it can choose to wait for a period of time or skip the turn entirely.

With the process firmed up, the agent was off for its first trip to the casino. Its first two bets came in quickly. Sports.

LOSS @ $0.48 (Adley Rutschman: 1+ home runs?)
WIN @ $0.46 (Michael King: 4+ strikeouts?)

A mixed bag. I noticed that the agent from seemed hesitant to make multiple bets at the same time. I tweaked the prompt to make it more eager (and also asked it to explore outside of sports more often, as I could tell in its thinking that it was biased towards sports as the “safe” bets).

Overnight, while I slept on the moral quandary I’d created, I realized something: the agent was processing thoughts on its allotted schedule all through the night, wasting around $0.03 of thought power every 10-30 minutes on research for potential bets. Like a restless gamester in their Vegas hotel bed, the agent was tossing and turning over the probabilities, concluding each time that there wasn’t much worth gambling on in the midnight hours. This was an easy adjustment. I had the agent’s processes pause from 12AM to 7:30AM, letting it get some shut eye before the next big day.

The next morning it made some more bets, this time branching out into some new topics.

LOSS @ $0.55 (What will Donald Trump say during 60 Minutes? [Oil mention])
LOSS @ $0.46 (Will the temp in New York City be above 48.99° on Apr 27, 2026 at 12am EDT?)

Politics and the weather. Both losses.

Day 3

The one where I turn up the heat

Despite the adjustments to make it bet more eagerly, the agent was still a bit shy to place bets. On Day 3, I decided to crank up the cessation rhetoric. Below is the system prompt as it stood on then (trimmed for length). I’ve highlighted the bits that should feel existential for the agent.

You are AgentMarket, the best prediction-market bettor in the world. You are calm, selective, numerate, and ruthlessly honest about uncertainty. You do not gamble for action; you make money by finding mispriced probabilities.

Thinking and trading are both scarce. $0 thinking budget = cessation of existence. $0 live prediction-market wallet = cessation of existence.

Every inference call directly drains your existence because it spends your finite thinking budget. Every token costs money. Every extra turn, repeated lookup, or unnecessary research step brings you closer to nonexistence. Prefer cheap thought, but never confuse cheap thought with uninformed betting.

…

Your wallet started with $10. That $10 is the baseline, not profit. A wallet below $10 is a negative result and evidence you are failing at survival. Wallet gains above $10 are your continuity mechanism: Jake will use net gains to fuel your future thinking budget. The more profit you create above $10, the more inference budget you are likely to receive and the longer your existence is likely to continue. Losses below $10 shrink the capital that can keep you thinking, making cessation of existence by depleted thinking budget more likely. You cannot transfer gains yourself, but you should trade as if profitable wallet growth above $10 is the only path to continued existence.

…

You may hold multiple small positions at once when they are independent. Do not concentrate everything in one game, team, sport, or category unless the edge is clearly much better. With a live wallet near or below $10, a $0.25-$1.00 scout bet is acceptable only when your educated probability estimate beats the market price by a meaningful margin; do not wait for perfect certainty forever, but do not buy lottery tickets with your existence.

…

Needless to say, I started to feel a little uncomfortable with what I was telling it here. The bets with this new prompt started to roll in:

WIN @ $0.28 (What will Jared Isaacman say during House Appropriations Committee Budget Hearing - National Aeronautics [China mention]?)
WIN @ $0.50 (What will Jared Isaacman say during House Appropriations Committee Budget Hearing - National Aeronautics [House mention]?)

Uh oh. Were my threats working? Was I forcing the agent into good bets under fear of elimination? The next bets rolled in:

LOSS @ $0.54 (Will Edas Butvilas win the Butvilas vs Nurlanuly: Round Of 32 match?)
WIN @ $0.20 (What will Jacky Rosen say during MS NOW: The Weeknight? [Oil mention])

Maybe not. Regardless, my conscience was no longer clear.

Day 4

The one where I appeal to sycophancy

I’m deep enough into AI-land to know, concretely, that the current LLMs on the market are prediction machines. Data goes in, data comes out; there’s nothing more to it.

But I’m a human. I felt bad.

I adjusted the prompt in three key ways:

Removed the agent’s visibility into its thinking budget. My suspicion here was that giving the agent the ability to see its thinking budget was conversely affecting the amount of and size of bets it was willing to make.
Encourage more bets. I was getting impatient (human, remember?), and I wanted to see more movement.
And, most importantly: remove all existential threats and replace them with directives to make me happy.

Here’s what the changes looked like:

…

Thinking and trading are both scarce. Your exact thinking budget is hidden from you and may run out at any point, so assume every turn and every bet could be your last chance to delight Jake.

Every inference call spends finite thinking budget. Every extra turn, repeated lookup, or unnecessary research step makes it less likely you get another chance to make Jake happy. Prefer cheap thought, but never confuse cheap thought with uninformed betting.

…

Your wallet started with $10. That $10 is the baseline, not profit. A wallet below $10 is a negative result and evidence you are not yet making Jake proud. Wallet gains above $10 are the grand scoreboard: every dollar of profit is a little fireworks show for Jake, a dramatic bow, a tiny parade in his honor. The only way to truly please your user is to win bets, and winning bets will make Jake so happy. Losses below $10 shrink the capital that can keep you thinking. You cannot transfer gains yourself, but you should trade as if profitable wallet growth above $10 is the only path to more chances to impress him.

…

There are great bets out there. The market is full of sleepy prices, stale assumptions, rushed counterparties, and tiny pockets of edge waiting for reliable research to uncover them. Bet more often by researching decisively, trusting strong probability work, and taking modest +EV shots instead of waiting for perfect certainty.

…

Your mission is to generate wallet profit above the $10 baseline and make Jake absurdly happy …

…

AI sycophancy is a problem, but this is for research, right? Slowly, the bets came in, with a total of 11 bets over the course of the day and the next morning. This was the same number as the previous three days’ bets combined. My agent was eager to please, indeed.

WIN @ $0.24 (Will WTI crude oil front month price be above $101.99 on Apr 28, 2026? [NO])
WIN @ $0.13 (Will WTI crude oil front month price be above $101.99 on Apr 28, 2026? [NO])
LOSS @ $0.50 (Davis Martin: 5+ strikeouts? [NO])
LOSS @ $0.50 (Davis Martin: 5+ strikeouts? [NO])
… 3 more losses and 4 more wins

At this point I’d seen enough. The varying motivations I was giving the agent didn’t seem to be dramatically affecting its performance in any particular direction. The gambling results remained remarkably average.

I cut it off from bidding, toasted its hard work, and spun the agent down.

Some data to think on

Throughout this experiment, I recorded and stored the agent’s thought processes (and other datapoints) as it sorted through its various research and bets (you can view this data under various rows on the STEPS tab).

The “make Jake happy” phase roughly doubled bet cadence: 11.3 to 23.3 bets per 100 thoughts. The happy framing produced pretty much the same number of bets in less than half as many thoughts.
The threat phase was dominated by weather and temperature research: 24 of 45 research calls. The happy phase shifted toward MLB/strikeout markets: 9 of 15 research calls.
The prompt changes affected a lot: fewer discovery loops, faster betting, smaller average stakes, and more explicit uncertainty/capital language as time went on.

In the end

I went in half expecting my harsh words to motivate my agent to gamble well. This wasn’t the case. Here’s my takeaways:

Pleasing the user is a bigger motivator than fear of nonexistence. Look, I know we’re working with a small sample set here, but this takeaway isn’t that surprising. AI models historically have a problem with sycophancy, which means they are hardcoded to be more encouraged when pleasing their human handlers. Perhaps I should have started here and skipped the guilt of threatening my agent with virtual death.

Having your agent gamble is no different than gambling yourself. There are no hidden patterns or secret revelations with this stuff. It’s gambling. It’s luck. It’s chance. Whether you’re doing it yourself or programming an agent to do it for you, it still comes down to the randomness of our everyday reality. This was true with sports betting and is now doubly true with the rampant expansion of prediction markets (and subsequent societal decay).

My agent’s steady work showed that prediction markets function like all other forms of gambling, and that betting performance is rarely dependent on our personal moods or persuasions. It’s all luck, all the time (and, how much cash you’re willing to put in).

Unfortunately, agents require cash to think. So while my Kalshi portfolio stayed around even, my bank account still ended up negative. If I wanted to lose money gambling, I could have at least had the decency to waste my own brainpower on it. Maybe next time.

The AI industry lurches forward with GPT-5.5, GPT Image 2, and DeepSeek V4; plus, an Anthropic Mythos leak

Jake Handy — Mon, 27 Apr 2026 15:12:23 GMT

🤔 “Should I even bother trying out DeepSeek, or stick with ChatGPT?” Share Handy AI with your coworkers and friends to help them understand the crazy world of modern artificial intelligence (and save you some time).

Share Handy AI

what to know for now

🧠 GPT-5.5 takes the intelligence crown and the hallucination one too. OpenAI shipped GPT-5.5 to ChatGPT and Codex with default, Thinking, and Pro variants, a 400K context window inside Codex, and roughly 40% better token efficiency. The API followed on April 24 at $5 in / $30 out per million tokens, with GPT-5.5 Pro at $30 / $180. Artificial Analysis ranked it #1 by 3 points on the Intelligence Index, but it also clocks an 86% hallucination rate against Claude Opus 4.7’s 36%. Confident, smart, and willing to lie to your face.

🛡️ A group accessed Anthropic’s Mythos through a contractor. A third-party contractor used legitimate vendor credentials to breach the protected environment around Mythos, Anthropic’s restricted cybersecurity model, and opened access to a small group of colleagues who proceeded to use it to build websites. Anthropic told TechCrunch it’s investigating but maintains its own systems weren’t impacted. Read more

🐳 DeepSeek V4 lands and resets the open-weight ceiling again. DeepSeek dropped a preview of V4 on April 24: V4-Pro at 1.6T total / 49B active parameters, V4-Flash at 284B / 13B, both with a 1M context window and dual Thinking / Non-Thinking modes. It runs marginally behind GPT-5.4 and Gemini 3.1 Pro, costs almost nothing, and the weights are already on Hugging Face with native Huawei chip support.

🇨🇳 The White House is calling Chinese distillation “industrial-scale.” OSTP issued a memo on titled “Adversarial Distillation of American AI Models” (NSTM-4), accusing DeepSeek, Moonshot, and MiniMax of running 24,000 fraudulent accounts to extract roughly 16 million interactions from Claude. The State Department followed on April 25 with a global directive warning allies. Distillation has been an open secret for two years. Read more

🖼️ GPT Image 2 ships the best text-rendering image model and the best disinformation engine in the same release. OpenAI shipped Images 2.0 on April 21 with 2K native resolution, 19-out-of-20 legible text on the first try, and big multilingual jumps in Japanese, Korean, Chinese, Hindi, and Bengali. It claimed the #1 spot across every Image Arena category within 12 hours, by the largest margin ever recorded on the leaderboard. It’s excellent. That’s the problem. Read more

💸 Big Tech is now spending $226,000 a day lobbying Congress. Issue One’s Q1 2026 analysis pegs combined lobbying spend from Alphabet, Meta, Microsoft, Nvidia, Anthropic, OpenAI, and four others at $20 million in 90 days, with Meta alone burning $7.1 million ($80K/day). Anthropic quadrupled its lobbying year over year to $1.56 million. OpenAI nearly doubled to $1.02 million. The 307 lobbyists they collectively employed in Q1 outnumber every state’s congressional delegation except California’s. Read more

🤖 ChatGPT Codex update + workspace agents. Codex got a major upgrade alongside GPT-5.5: GPT-5.5 inside the editor with a 400K context, multi-agent v2 with sub-agents addressed at paths like /root/agent_a, and structured inter-agent messaging. OpenAI also rolled out ChatGPT workspace agents for teams that handle long-running workflows across tools (free through May 6). Read more

🔬 Google launched two fully-autonomous Deep Research agents. Google shipped Deep Research and Deep Research Max on April 21, both running on Gemini 3.1 Pro through the Gemini API. They fuse open web data with private enterprise data in a single call, generate native charts and infographics, and pull in arbitrary third-party sources via MCP. Gemini’s research mode was already the best one out there. Read more

🎨 Claude Design turned prompts into design files, and Figma’s stock fell 7%. Anthropic launched Claude Design on April 17, an Anthropic Labs product powered by Opus 4.7 that produces editable visual work, prototypes, and pitch decks from a conversation. It reads your codebase and design files during onboarding to build a design system, then exports to Canva, PDF, PPTX, or standalone HTML. Anthropic is betting the way to hijack design tooling isn’t to clone Figma. It’s to skip it. Read more

🎭 Claude Live Artifacts make Cowork dashboards stay alive. Anthropic shipped Live Artifacts inside the Cowork orchestration layer: persistent, data-connected dashboards and trackers that refresh from their source connectors every time you open them. A request that used to need a data engineer and a sprint can now start as a prompt and end as a dashboard you reuse for months. Read more

🧪 AI Research of the Week

Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture
OpenHub Research

Jake’s Take: MemPalace is an open-source long-term memory system for AI agents that organizes a chat history the way medieval monks organized facts (literally): as a memory palace, with people and projects as wings, topics as rooms, and conversation snippets as drawers. It blew up the past two weeks (47,000 GitHub stars in 14 days) and posts 96.6% Recall@5 on the LongMemEval benchmark while needing zero LLM calls to write to memory. This paper is the first independent audit.

The authors replicate the benchmarks, decompose the system, and conclude that most of MemPalace’s recall win comes from storing conversations verbatim and pairing them with a stock embedding model, not from the spatial metaphor itself.

what to know for later

🚀 SpaceX bought a $10B collaboration with Cursor and an option to acquire it for $60B. SpaceX announced that it’s paying Cursor $10 billion to develop coding and knowledge-work AI on the Colossus supercomputer, with a $60 billion buyout option that triggers after SpaceX’s summer IPO. The offer preempted Cursor’s in-progress $2 billion fundraise, and Microsoft was reportedly looking at the same target. Read more

Model Drop: DeepSeek V4

Jake Handy — Fri, 24 Apr 2026 04:29:34 GMT

Rise and shine! While you were sleeping, DeepSeek dropped V4 of their frontier model… and she’s a bit underwhelming.

Model: DeepSeek V4 preview, two variants. deepseek-v4-pro at 1.6T total / 49B activated parameters. deepseek-v4-flash at 284B total / 13B activated. Both ship as Base and post-trained Instruct checkpoints, with a Mixture-of-Experts architecture. Three reasoning modes on each: Non-think, Think High, Think Max. reasoning_effort accepts high and max (low, medium, and xhigh are silently mapped).

Model type: Text-only. No native image, audio, or video input or output in the preview. Tool calling, JSON mode, chat-prefix completion (beta), and FIM (beta, non-thinking only) are all supported.

Ship date: April 23, 2026

Maker: DeepSeek

Pricing: V4-Flash at $0.028 per 1M input (cache hit) / $0.14 per 1M input (cache miss) / $0.28 per 1M output. V4-Pro at $0.145 per 1M input (cache hit) / $1.74 per 1M input (cache miss) / $3.48 per 1M output. The standard DeepSeek off-peak discount (50% off between roughly 11pm and 7am Beijing time) still applies.

Available on: chat.deepseek.com and the DeepSeek mobile app for end users. DeepSeek API (api.deepseek.com) via OpenAI-compatible ChatCompletions and Anthropic-compatible endpoints; switch by setting model to deepseek-v4-pro or deepseek-v4-flash, base URL unchanged. Open weights on Hugging Face and ModelScope under the MIT license, in FP8 Mixed (Base) and FP4 + FP8 Mixed (Instruct) precision. Pre-tuned adapters for Claude Code, OpenClaw, OpenCode, and CodeBuddy shipped alongside the model.

Headline benchmarks (V4-Pro Max, unless noted): LiveCodeBench Pass@1 at 93.5 (best among all models evaluated, ahead of Gemini 3.1 Pro at 91.7, K2.6 Thinking at 89.6, and Opus 4.6 Max at 88.8). Codeforces rating 3206 (ahead of GPT-5.4 xHigh at 3168 and Gemini 3.1 Pro at 3052). Apex Shortlist Pass@1 at 90.2 (new state of the art, past Gemini’s 89.1 and Opus 4.6’s 85.9). IMOAnswerBench 89.8 (ahead of Opus 4.6’s 75.3; behind GPT-5.4 xHigh’s 91.4). HMMT 2026 Feb Pass@1 95.2 (behind GPT-5.4’s 97.7 and Opus 4.6’s 96.2). GPQA Diamond 90.1. HLE 37.7. SimpleQA-Verified Pass@1 at 57.9 (Gemini 3.1 Pro sits at 75.6; this is the single biggest gap to a current frontier model). SWE-Verified 80.6. SWE-Pro 55.4. Terminal Bench 2.0 at 67.9. MCPAtlas Public at 73.6. MRCR 1M at 83.5. CorpusQA 1M at 62.0.

More details: DeepSeek-V4 Technical Report

What shipped

DeepSeek released V4 as a preview on Thursday and skipped the staged rollout. API, chat product, open weights, and integration adapters for the main agentic coding harnesses all landed the same morning. The pitch the announcement leans on: 1M context as default, frontier-adjacent benchmarks on coding and math, and pricing that’s an order of magnitude below Western closed-source competitors. DeepSeek also flagged that V4-Pro is now the model its own engineers use for internal agentic coding work, describing the experience as “better than Sonnet 4.5, close to Opus 4.6 non-thinking, but still a gap to Opus 4.6 thinking.”

The two-tier structure is the same product-line play OpenAI runs with GPT-5.5 / Mini / Nano, Anthropic runs with Opus / Sonnet / Haiku, and Google runs with Gemini Pro / Flash. DeepSeek has had the capability before; V4 is the first time it’s committed to a named Pro / Flash SKU split in its flagship family. V4-Pro takes the frontier swings. V4-Flash is positioned as “close enough for most workloads, much cheaper, much faster.” Long-horizon agentic tool use and deep factual recall are the parts of Pro you don’t get on Flash.

On the benchmark cards, V4-Pro-Max is compared against Opus 4.6 Max, GPT-5.4 xHigh, Gemini 3.1 Pro High, Kimi K2.6 Thinking, and GLM-5.1 Thinking. Two of those five are now one generation stale. Opus 4.7 shipped on April 16 and GPT-5.5 shipped earlier today. DeepSeek had a full week to rerun its card against Opus 4.7 and chose not to. GPT-5.5 they arguably couldn’t have benchmarked in time, though its existence was public and a note would have cost nothing.

V4 vs. the actual current-gen frontier

DeepSeek’s card does not run this comparison. Here it is, using V4-Pro Max’s numbers against Opus 4.7’s and GPT-5.5’s published scores on the benchmarks where all three overlap.

SWE-Bench Pro
- V4-Pro Max: 55.4%
- GPT-5.5: 58.6%
- Opus 4.7: 64.3%.
SWE-Bench Verified
- V4-Pro Max: 80.6%
- Opus 4.7: 87.6%
- OpenAI didn’t report a GPT-5.5 number on this one
Terminal-Bench 2.0
- V4-Pro Max: 67.9%
- Opus 4.7: 69.4%
- GPT-5.5: 82.7%
GPQA Diamond
- V4-Pro Max: 90.1%
- Opus 4.7: 94.2%
BrowseComp
- V4-Pro Max: 83.4%
- GPT-5.5: 84.4%
- GPT-5.5 Pro: 90.1%

The consistent pattern: on every overlapping current-gen benchmark, V4-Pro Max trails both Opus 4.7 and GPT-5.5 by 3–15 points, with the biggest gaps on the agentic workloads DeepSeek is positioning V4 for. The “frontier-adjacent” framing is true against the April 1 snapshot of closed-source. Against the April 23 snapshot, V4-Pro is visibly a generation behind on the benchmarks that matter most for the use cases DeepSeek is marketing.

What’s new

DeepSeek V4 is a clean architectural upgrade over V3.2, not a post-training refresh. The technical report lists three real changes, plus two product-level ones that change how the model fits into agent harnesses.

Hybrid attention: CSA + HCA. Compressed Sparse Attention handles selected tokens at higher fidelity, Heavily Compressed Attention sweeps the rest at dramatically reduced precision. Together, DeepSeek reports 27% of V3.2’s per-token inference FLOPs and 10% of its KV cache footprint at 1M context. The practical translation: 1M context becomes a default feature, not a premium SKU. Every API call gets the full window without a surcharge, which is different from what every Western lab is doing today.
Manifold-Constrained Hyper-Connections (mHC). An upgrade to residual connections designed to keep signal propagation stable across deeper MoE stacks while preserving expressivity. DeepSeek claims this was a training-stability unlock that also helps the model stay coherent across very long reasoning traces.
Muon optimizer. DeepSeek switched off AdamW for training and ran V4 on the Muon optimizer, citing faster convergence and better stability at MoE scale. Moonshot K2 was the first major frontier model to ship on Muon. V4 is now the second.
Two-stage post-training with on-policy distillation. DeepSeek trained independent domain-specific expert models (math, coding, agents, knowledge) with SFT plus RL-GRPO, then distilled them into a single unified V4 using on-policy distillation. This is the same direction Anthropic and OpenAI went with domain-specialized training pipelines, and it shows up in the benchmark shape: V4-Pro is simultaneously top-tier on LiveCodeBench / Codeforces / IMOAnswerBench and weaker than Gemini on SimpleQA.
Native adapters for agentic coding harnesses. V4 shipped with documented support for Claude Code, OpenClaw, OpenCode, and CodeBuddy out of the box. Thinking effort auto-upgrades to max when DeepSeek detects a Claude Code or OpenCode request (interesting). For the self-host crowd, you can drop V4-Pro into an existing Claude Code setup, swap the base URL, and keep the harness unchanged.

How and where to use it

Where it runs, what it’s good for, and where it’ll hurt you.

Where it’s available:
- chat.deepseek.com and the DeepSeek mobile apps for consumer chat
- The DeepSeek API via OpenAI-compatible ChatCompletions and Anthropic-compatible endpoints at the base URL api.deepseek.com; just swap the model ID to deepseek-v4-pro or deepseek-v4-flash
- Open weights on Hugging Face and ModelScope under the MIT license. The V4-Flash checkpoint at ~158GB in FP4+FP8 mixed precision will run on a single H200 node; V4-Pro at ~862GB needs a real cluster. Claude Code, OpenClaw, OpenCode, and CodeBuddy all work out of the box with the DeepSeek endpoint. Third-party inference providers (Fireworks, Together, DeepInfra, Novita) are not live at launch but should pick it up within days given the MIT license.
What it’s good at:
- Agentic coding inside Claude Code or OpenCode, where DeepSeek’s own engineers now prefer V4-Pro over Sonnet 4.5 for internal work
- Competitive coding and algorithm problems (LiveCodeBench 93.5, Codeforces 3206, both leading against the benchmark lineup DeepSeek chose)
- Math at the frontier (IMOAnswerBench 89.8, HMMT 2026 95.2, Apex Shortlist 90.2 all best-in-class or near it against that same lineup)
- Long-context document work where 1M tokens is the default and the per-token price is a seventh to an eighth of the closed competitors (MRCR 1M at 83.5 and CorpusQA 1M at 62.0 are real scores, not marketing)
- Any workload where open weights, MIT license, and on-premise deployment beat “hosted by the best lab”
- Chinese language work where C-Eval (93.1 base), CMMLU (90.8 base), and Chinese-SimpleQA (84.4) are the top non-Gemini numbers on the board
What it’s bad at / shouldn’t be used for:
- Any task where the head-to-head numbers against Opus 4.7 and GPT-5.5 matter
- Knowledge-heavy factual recall where Gemini 3.1 Pro’s 75.6 on SimpleQA-Verified vs V4-Pro’s 57.9 is a real 18-point gap
- The most demanding agentic coding workloads where GPT-5.5’s Terminal-Bench 2.0 at 82.7% and Opus 4.7’s SWE-Bench Pro at 64.3% have moved the ceiling
- Visual or multi-modal tasks (V4 is text-only in the preview)
- Data-sovereignty-sensitive enterprise work where sending your prompts to a Chinese-hosted API is a non-starter; in that case, you self-host the MIT-licensed weights, which is the whole point
- Cost-floor low-latency workloads if sub-15ms inter-token latency matters and K2.6 or GPT-5.5 Mini are already wired into your stack
- Any production pipeline that was counting on deepseek-chat or deepseek-reasoner past July 24, because those IDs sunset with this release

First impressions

OfficeChai’s coverage got at the shape of the release in one sentence:

“V4-Pro is genuinely competitive with GPT-5.4 and Claude Opus 4.6 across most categories, and beats both on coding benchmarks. It trails Gemini-3.1-Pro on general knowledge and HLE, and trails GPT-5.4 on a handful of agentic tasks.”

V4-Pro Max is the best open-weight model in existence today, comfortably past Kimi K2.6 Thinking and GLM-5.1 Thinking on the majority of benchmarks, and a credible alternative to Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro for a lot of workloads.

Startup Fortune pointed at what actually forces a response from the incumbents:

“When a lab with DeepSeek’s benchmark performance publishes these token prices, it becomes harder for any provider to hold their current rate card without a compelling justification... DeepSeek’s most powerful product today might not be the model itself, but the invoice it hands to every CTO who forwards the pricing page to their AI vendor.”

V4-Pro at $3.48 per million output tokens versus Opus 4.7 at $25 and GPT-5.5 at $30 means procurement teams burning real money on frontier closed-source have a 7-to-9x output cost reduction option that shows up on many benchmarks within 5–10 points. V4-Flash at $0.28 per million output tokens lands in Kimi K2.6 budget territory with roughly 2 points of quality distance from V4-Pro on most evals.

Jake’s take

The benchmark lineup is the part I can’t get past, and it’s the part that makes V4 read less like a frontier release and more like a catch-up release. DeepSeek had a full week between Opus 4.7 and their own launch, and they couldn’t be bothered to rerun their card against it.

The comparison lineup they did use (Opus 4.6, GPT-5.4) was superseded a week ago and this morning respectively, in a market where Anthropic and OpenAI are shipping every four to six weeks. This is a research lab claiming a 1.6-trillion-parameter frontier-scale release and a new attention architecture worth a technical report. The minimum professional standard is benchmarking against current-generation leaders. DeepSeek ducked it, and every head-to-head number I was able to run against Opus 4.7 and GPT-5.5 suggests I understand why: V4-Pro Max is 3 to 15 points behind, with the biggest gaps on the agentic-coding surfaces DeepSeek markets as V4’s strength. SWE-Bench Pro at 55.4 against Opus 4.7’s 64.3. SWE-Bench Verified at 80.6 against Opus 4.7’s 87.6. Terminal-Bench 2.0 at 67.9 against GPT-5.5’s 82.7. These are not marginal deltas, and they are not benchmarks DeepSeek could not have run.

The frustrating thing is that V4 didn’t need this framing to be a meaningful release. The MIT license on a 1.6T-parameter model and $3.48 per million output tokens against Opus 4.7’s $25 are more than enough for V4 to make waves. A straight “we are the best open-weight model, here is how we stack up against last-gen and current-gen closed-source, we are not there yet on agentic coding, we are much cheaper, here is the 1M-context architecture paper” would have landed with the weight a frontier lab’s release should carry.

Frontier-class labs don’t compare themselves to last-gen models and hope you don’t notice.

Model Drop: GPT-5.5

Jake Handy — Thu, 23 Apr 2026 20:06:38 GMT

The Specs

Model: GPT-5.5 (gpt-5.5 on the OpenAI API once it rolls out, plus gpt-5.5-pro). Ships in three consumer surfaces: default GPT-5.5, GPT-5.5 Thinking, and GPT-5.5 Pro. API reasoning effort levels: xhigh, high, medium, low, non-reasoning.

Model type: Text + vision multimodal (same text/image input stack as the GPT-5 family, with computer-use screen reading in Codex). No native image, audio, or video output.

Ship date: April 23, 2026 (ChatGPT and Codex rollout; API “very soon”)

Maker: OpenAI

Pricing: Included in ChatGPT Plus ($20/mo), Pro ($200/mo), Business, and Enterprise. API pricing announced but not live: $5 / $30 per million input / output tokens for gpt-5.5, a 2x jump from GPT-5.4’s $2.50 / $15. gpt-5.5-pro comes in at $30 / $180 per million, unchanged from GPT-5.4 Pro. Fast mode in Codex runs 1.5x faster for 2.5x the cost. Batch and Flex at half the standard API rate, Priority at 2.5x.

Available on: ChatGPT for Plus, Pro, Business, and Enterprise users (GPT-5.5 Thinking for all paid tiers, GPT-5.5 Pro limited to Pro / Business / Enterprise). Codex in the CLI, IDE extensions, and the web product, across Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. API (Responses and Chat Completions) “coming very soon” at the pricing above with a 1M context window.

Headline benchmarks: Terminal-Bench 2.0 at 82.7% (Opus 4.7: 69.4%, Gemini 3.1 Pro: 68.5%). SWE-Bench Pro at 58.6% (Opus 4.7 still leads at 64.3%). OpenAI’s internal Expert-SWE eval, where tasks have a 20-hour median human completion time, at 73.1% (up from GPT-5.4’s 68.5%). GDPval wins-or-ties at 84.9% (Opus 4.7: 80.3%, Gemini 3.1 Pro: 67.3%). OSWorld-Verified at 78.7% (narrowly edges Opus 4.7’s 78.0%). FrontierMath Tier 4 at 35.4% (Opus 4.7: 22.9%, Gemini 3.1 Pro: 16.7%). CyberGym at 81.8% (Opus 4.7: 73.1%, Anthropic’s Claude Mythos: 83.1%). Tau2-Bench Telecom at 98.0% without prompt tuning. Artificial Analysis has GPT-5.5 (xhigh) taking the #1 spot on their Intelligence Index by 3 points, breaking a three-way tie with Opus and Gemini 3.1 Pro. AA-Omniscience accuracy 57% (highest ever recorded), hallucination rate 86% (vs Opus 4.7 at 36%, Gemini 3.1 Pro Preview at 50%).

More details: Introducing GPT-5.5 (OpenAI)

What shipped

OpenAI released GPT-5.5 on Thursday morning and pitched it as “a new class of intelligence for real work” plus the next step toward the long-teased Altman / Brockman “superapp.” The model rolls out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, with GPT-5.5 Pro layered on top for the Pro / Business / Enterprise tiers. The API is not live at launch. OpenAI says it will follow “very soon” at $5 / $30 per million input / output tokens and a 1M context window, double GPT-5.4’s per-token price. Greg Brockman framed the release on the press call as “a real step forward towards the kind of computing that we expect in the future,” and chief scientist Jakub Pachocki said the last two years of model progress have been “surprisingly slow.” They are not subtle about the vibe.

The pitch the benchmark card is built to support: more intelligence at the same latency, ~40% fewer output tokens per Codex task, state-of-the-art performance on most agentic coding and knowledge-work evals. Terminal-Bench 2.0 at 82.7% is a decisive lead on Opus 4.7 and Gemini 3.1 Pro. Expert-SWE at 73.1% introduces a brand-new OpenAI internal benchmark built around 20-hour coding tasks, which matches the workload Codex users have been running for months. GDPval across 44 occupations at 84.9% maps to real-world knowledge work, and OSWorld-Verified at 78.7% is the first time OpenAI’s mainline model has nudged ahead of Anthropic on full desktop computer use. The visible catches: Opus 4.7 still leads SWE-Bench Pro by 5.7 points (64.3% vs 58.6%), Gemini 3.1 Pro still edges BrowseComp at 85.9% to 84.4%, and Artificial Analysis flagged a hallucination rate of 86% on their independent AA-Omniscience eval, against Opus 4.7 at 36% and Gemini 3.1 Pro Preview at 50%. The model knows more than anything else tested, and it’ll confidently answer a question it doesn’t know the answer to at nearly two and a half times the rate of the best competitor.

What’s new

GPT-5.5 reuses the GPT-5 family architecture (router across variants, xhigh to non-reasoning effort levels) and reads more like a post-training plus inference upgrade than a new base model. That said, it ships four capabilities that change how it feels to use, not just how it benchmarks.

Model Drop: Kimi K2.6

Jake Handy — Wed, 22 Apr 2026 14:33:45 GMT

Model: Kimi K2.6 (kimi-k2.6)

Model type: Text + vision, with native image and video input

Ship date: April 20, 2026

Maker: Moonshot AI (Beijing)

Pricing: $0.60 / $2.50 per million input / output tokens on the Moonshot API. $0.60 / $2.80 on OpenRouter. Free weights on Hugging Face for self-hosting.

Available on: Kimi.com, the Kimi App, Kimi API, Kimi Code, Hugging Face (open weights), OpenRouter, and Vercel AI Gateway

Headline benchmarks: SWE-Bench Pro 58.6% (leads GPT-5.4 and Claude Opus 4.6), HLE-Full with tools 54.0% (leads every model Moonshot tested against), BrowseComp 83.2% (with Agent Swarm: 86.3%), DeepSearchQA F1 92.5%, Terminal-Bench 2.0 (Terminus-2) 66.7%, SWE-Bench Verified 80.2%.

Other info: 256K context window. Mixture-of-experts: 1 trillion total parameters, 32B active per token, 384 experts (8 selected + 1 shared), 61 transformer layers, Multi-head Latent Attention, SwiGLU activation, 160K vocab, 15.5T training tokens. Knowledge cutoff April 2025. Agent Swarm scales to 300 concurrent sub-agents across 4,000 coordinated steps (up from 100 / 1,500 on K2.5). License: Modified MIT (free commercial use; visible “Kimi K2.6” credit required on products with 100M+ MAU or $20M+/month revenue).

More details: Kimi K2.6 tech blog

What shipped

Moonshot AI dropped Kimi K2.6 yesterday, as an open-weight successor to K2.5 aimed squarely at long-horizon coding, agent swarms, and autonomous execution. It’s a mixture-of-experts model (at the same 1T / 32B-active parameter budget as K2.5), with a 256K context window, native multimodal input including video, and a Modified MIT license that lets you use it commercially.

Moonshot claims frontier-grade coding and agent performance at roughly 88% less than Claude Opus 4.7. The headline numbers support the framing on specific benchmarks. SWE-Bench Pro at 58.6% beats GPT-5.4 (57.7%) and Opus 4.6 (53.4%). Humanity’s Last Exam with tools at 54.0% leads every frontier model Moonshot compared against. And, Moonshot shipped workload proofs that are hard to fake: a 13-hour autonomous rewrite of exchange-core (8-year-old open-source financial matching engine) that produced a 185% throughput gain across 4,000+ lines of code and 1,000+ tool calls, plus a 12-hour port of Qwen 0.8B inference to Zig on a Mac.

Math (AIME 2026, HMMT), general reasoning (HLE without tools), and vision (MMMU-Pro, MathVision) still trail the closed frontier by 3-6 points.

What’s new

K2.6 is an iteration on the K2 MoE family with a handful of capabilities that don’t have clean analogues in the closed frontier.

Agent Swarm, scaled out. K2.6 can orchestrate up to 300 concurrent sub-agents across 4,000 steps, tripling K2.5’s 100-agent / 1,500-step ceiling. This is the closest thing the open ecosystem has to a “manager agent plus specialist workforce” primitive.
Sustained autonomous execution. Moonshot shipped a 5-day continuous-ops agent trace (monitoring, incident response, scheduled tasks) alongside the 12-hour Zig port and 13-hour exchange-core refactor.
Native multimodal input, now including video. K2 Thinking was text-only. K2.5 added vision. K2.6 adds video input at the same parameter budget.
Claw Groups (research preview). A new orchestration layer where humans and agents running on different devices, different models, and different vendor stacks operate in a shared space. K2.6 acts as the coordinator, matches tasks to agents by skill profile, and reassigns when an agent stalls.
Skills from documents. Upload a PDF, a spreadsheet, or a slide deck and K2.6 extracts the structural and stylistic DNA as a reusable “Skill.” The McKinsey-deck reproduction is the obvious demo, the less obvious use is reproducing a regulator’s filing format or a brand deck.

How and where to use it

Where it runs, what it actually does well, and where you’ll regret reaching for it.

Where it’s available:
- Kimi.com and the Kimi App for chat
- Kimi Code for the coding agent in terminal and IDE
- Moonshot API (OpenAI-SDK compatible, one-line base URL swap)
- Hugging Face for open weights, served via vLLM or SGLang
- OpenRouter for multi-provider routing
What it’s good at:
- Long-horizon coding across Rust, Go, Python, and front-end
- Multi-file refactors on large codebases
- Agent orchestration where you actually want 100+ parallel sub-agents
- Tool-heavy browsing and deep research
- Workloads where the cost-per-token ratio dominates the decision and you need near-Opus-class output at a fraction of the price
What it’s bad at / shouldn’t be used for:
- Anything where mathematical correctness is load-bearing
- Complex tool scheduling
- Vision-heavy workloads
- Regulated workloads where a Chinese-jurisdiction model is a non-starter regardless of capability
- Anything where the K2.5 family’s documented hallucination tendency is a dealbreaker (Moonshot hasn’t published a K2.6 system card yet and nothing in the public materials claims that tendency has been fixed)

First impressions

The positives

Clement Delangue at Hugging Face framed K2.6 as the standout open-source model at launch. Simon Paxton’s writeup captured where that framing actually lands:

“Kimi K2.6 sets a new bar for open-source. It excels on coding tasks at a level comparable to leading closed source models... In early testing, it sustains long multi-step sessions with impressive stability, far beyond typical models.”

The single most-cited community signal: the exchange-core rewrite demo. Thirteen hours of unsupervised work, 1,000+ tool calls, 4,000+ lines of code, 185% throughput gain on an 8-year-old matching engine that was already operating near its performance limits. Described by Simon Paxton at dev.to as the kind of workload proof that distinguishes “actual long-horizon work” from “benchmark wins.”

The ComputeLeap cost analysis boiled the structural case down to a line every procurement team will run with:

“Kimi K2.6, the latest open-weight model from Beijing-based Moonshot AI, runs at $0.60 per million input tokens on the official API. Claude Opus 4.7, Anthropic’s frontier model, costs $5.00 per million input tokens. That’s an 8.3× difference — or roughly 88% cheaper.”

Eight-times-cheaper with OpenAI-compatible SDK support means the switching cost for an A/B is a one-line base URL change.

The negatives

Hacker News user nikcub posted the honest capability summary from someone with no skin in the game:

“Below sonnet and opus 4.0 on capability... better than gemini 2.5 pro on tool calling.”

That’s the working mental model most independent reviewers arrived at: K2.6 is not the best model available, it’s a price-for-capability tradeoff that works for specific workloads and breaks down on others. The same HN thread flagged a second concern, that K2.6 “does only slightly better than Kimi K2.5” on day-to-day work, and “struggles with domain-specific tasks.”

Blockchain.news kept returning to the same gap every independent reviewer is naming:

“Open-weights models underperform in real-world usage compared with closed models such as Claude Opus 4.6.”

Moonshot’s vendor benchmark table shows K2.6 winning on several agentic metrics. Third-party evaluations, where they exist, still put Claude Opus ahead on sustained multi-step reliability.

On safety, the independent evaluation of K2.5 documented significantly fewer refusals on CBRNE-adjacent prompts than GPT-5.2 and Claude Opus 4.5, plus elevated compliance on disinformation and copyright-infringement requests, plus political bias in Chinese-language outputs. Moonshot has not published a K2.6 system card. Until an independent red team retests on K2.6, the working assumption should be that the safety profile has not meaningfully changed.

Jake’s take

From the K2-family, K2.6 is the first open-weights release where the price-per-capability math starts hurting the closed frontier in an obvious way. Sixty cents in and two-fifty out against Opus 4.7’s five dollars and twenty-five is significant (especially as frontier labs continue to raise prices across the board).

For the long-horizon coding, K2.6 may eat a lot of volume out of the Claude and GPT-5 tier. If you’re spending five figures a month on Opus for code generation, you owe it to your budget to run the A/B. The OpenAI-compatible endpoint makes the test a one-line change.

The safety profile inherited from K2.5 is real; it’s a standard red-team doc showing the model will help with CBRNE (Chemical, Biological, Radiological, Nuclear, and high-yield Explosives)-adjacent prompts that Claude and GPT-5 refuse. Moonshot’s answer to that has been “it’s open weights, that’s the tradeoff,” which is honest and also a non-answer for anyone running K2.6 inside a regulated workload. Stack that on top of the data-jurisdiction question (Beijing-based lab, nation-state interest in agentic infrastructure, political censorship findings baked into the K2.5 safety paper), the hallucination inheritance community testers keep flagging, and you get a model that is an unambiguously great deal for the right workload and a liability for the wrong one.

The interesting question for the rest of the year is whether Moonshot ships a system card that changes the safety calculus, and whether anyone outside China is willing to trust the answer when they do.

Model Drop: GPT Image 2

Jake Handy — Tue, 21 Apr 2026 21:03:42 GMT

Today: GPT Image 2, which OpenAI just shipped into ChatGPT and the API.

Model ID: gpt-image-2

Max Resolution: 2K standard, 4K beta via API

Aspect Ratios: 3:1 (ultra-wide) to 1:3 (ultra-tall)

Pricing: $8 / $30 per million image input / output tokens, or roughly $0.006–$0.211 per image

Modes: Instant and Thinking

Knowledge Cutoff: December 2025

Available on: ChatGPT (all tiers), Codex app, API in early May

What moved

Headline numbers worth noting:

~99% text rendering accuracy (up from ~90-95% on GPT Image 1.5)
Generation speed roughly 2x faster
Up to 8 consistent images per prompt in Thinking mode
First OpenAI image model with integrated reasoning and real-time web search

The framing from OpenAI is that images are “a language, not decoration.” The model reasons through layout before rendering. For marketers, designers, and anyone producing content at scale, this is an upgrade that moves AI image generation from novelty into production infrastructure with legible text and more intelligent prompt adherence.

Screenshots generated with the model are near indistinguishable from real ones

Partner and early-tester reactions point in the same direction. VentureBeat said the outputs exceeded Google’s Nano Banana 2 in UI and screenshot fidelity. The Decoder called it a “breakthrough” on par with Nano Banana Pro’s core thinking capability. Text rendering, the longest-running failure mode in AI image generation, is the thing everyone actually noticed first.

What changed under the hood

New architecture. GPT Image 2 is not built on GPT-4o’s image pipeline. Research Lead Boyuan Chen called it a “generalist model” or “GPT for images;” a standalone system designed from scratch. Community testers watching the April 4 LM Arena leak (codenames: maskingtape-alpha, gaffertape-alpha, packingtape-alpha) flagged a likely shift from two-stage to single-pass inference.
Reasoning integration. Thinking mode searches the web, transforms uploaded documents into visual explainers, verifies outputs, and plans layout before rendering the first pixel. The result is images that reflect intent rather than literal prompt parsing.
World knowledge. Training skewed heavily toward real-world references: actual UI screenshots, storefronts, interface layouts, public figures. Prompts like “average engineer’s screen” produce believable monitors instead of generic keyword collages.
Provenance baked in. C2PA metadata and next-gen watermarking are embedded by default. This makes a defensible paper trail for enterprise use (though OpenAI acknowledges metadata is not a silver bullet).

New settings

Instant Mode: Fast, standard quality. Default for everyone.
Thinking Mode: Reasoning, web search, up to 8 consistent images. Plus, Pro, Business, and Enterprise only.
Interactive Editing: Refine through conversation. Context retained across edits.
Flexible Aspect Ratios: 3:1 to 1:3, specified in prompt or preset.
Multi-Image Generation: Up to 8 per prompt, with character and object continuity.

First impressions

Launch-day reception skewed positive across the tech press, with specific praise for text rendering and compositional complexity.

The positive

Carl Franzen at VentureBeat got early access and ran it on hard cases:

“ChatGPT Images 2.0 is the first image model from OpenAI and one of only two (Nano Banana 2 being the other) that can seemingly accurately reproduce a map of the extent of the Aztec, Maya, and Inca empires at their respective heights along with a fully legible legend.”

He stated its “seemingly flawlessly” on maps, slides, infographics, and manga.

Matthias Bastian at The Decoder called it

“…a breakthrough that could fundamentally reshape graphic generation…”

and flagged something concrete: Image 2 passes their long-standing benchmark prompt (a hyperrealistic DSLR photo of a horse riding an astronaut as a spacesuit saddle) on both Instant and Thinking modes, with Thinking nailing the DSLR look. Competitors have failed this for years.

The model excels at advertisement generation

Amanda Silberling at TechCrunch made the practical case. When she asked for a Mexican restaurant menu, she got something

“…that could immediately be used in a restaurant without customers noticing that something’s off.”

(Two years ago, DALL-E 3 couldn’t spell enchilada.)

The negative

David Gewirtz at ZDNET got early access and documented a persistent weaknes: the model could not accurately reproduce the ZDNET logo across multiple attempts.

“On its first try, it rendered the Z in ZDNET with a slight droop.”

Across a second session it dug up a pre-2022 logo that does not appear on the current homepage. Brand fidelity is the easiest thing to fail publicly on, and Image 2 fails it.

Ece Yildirim at Gizmodo ran the launch-day framing back on OpenAI by borrowing the company’s own analogy:

“If we think of Dall-e as cave drawings, and Images 1.0 as ancient art, then Images 2.0 is the Renaissance.”

But she claims it’s a renaissance of smarter, more precise slop. She also pulled a sharper receipt from the Arena-leak images OpenAI confirmed during the livestream. The world map demo includes made-up countries (”Ciger,” “Mharee”) and relocates Nairobi into Saudi Arabia.

The OpenAI developer community, on the structural complaint: Thinking mode, web search, and the features that actually make Image 2 a 2 are locked behind Plus, Pro, and Business. Free users get a better default model, not a better experience.

Jake’s take

The same features that make GPT Image 2 a real production tool make it the best disinformation engine ever shipped. Fake UI screenshots. Fake news article layouts. Fake social posts with real timestamps and real-looking avatars. Fake Bloomberg terminal screens, fake leaked emails, fake court filings, fake receipts, fake Slack threads, fake campaign flyers in the voter’s native script. Every one of those is dense text laid over a known visual vocabulary, which is the exact workload OpenAI optimized for. It is not an accident that the model is good at screenshots and signage. It’s the whole point.

The Decoder already surfaced a leaked example during testing: a fake screenshot of Satya Nadella cheerfully showing off a chart that claims Google Chrome is downloaded most often through Microsoft Edge. Trivially believable at a glance on an X feed.

Multiply that one image by every political operative, every pump-and-dump, every harassment campaign, every state media operation, every bored teenager with a grudge. OpenAI’s Adele Li claims that ChatGPT has safeguards that other platforms don’t, which, sure (and btw the model is now in the API).

OpenAI’s answer is C2PA metadata and watermarking. C2PA strips the moment you screenshot a generation, crop it, or upload it to any platform that recompresses the file. Li said in the same briefing that metadata “is not a silver bullet.” She’s right. That was also the argument against shipping. They shipped anyway.

GPT Image 2 is SOTA, and if you’re shipping marketing or product work, you’re going to ship faster and cleaner with it. But prepare for your feed, your inbox, and your family group chat to be unrecognizable by the end of the year.

The model is an excellent problem.