Model Drop: Composer 2.5

Cursor continues their quest to catch up to leading coding foundation models

May 18, 2026

Composer 2.5, the in-house coding model Cursor shipped to its IDE today. Built on the same open-source Moonshot Kimi K2.5 checkpoint as Composer 2, with 85% of its total compute spent on Cursor’s own post-training and RL stack.

The Specs

Model: Composer 2.5 (referred to as cursor-composer-2-5 in Cursor’s model picker).

Model type: Text-only agentic coding model. Tool-use native (file edit, terminal, search, MCP) inside the Cursor IDE.

Ship date: May 18, 2026

Maker: Cursor (Anysphere, Inc., San Francisco). Base weights from Moonshot AI (Beijing).

Pricing: Standard at $0.50 / $2.50 per million input / output tokens. Fast variant (default for interactive use) at $3.00 / $15.00 per million.

Available on: Cursor only. The agent runs inside the Cursor IDE, the Cursor CLI, and the Cursor web product.

Headline benchmarks: SWE-Bench Multilingual at 79.8% (Opus 4.7: 80.5%, GPT-5.5: 77.8%). Terminal-Bench 2.0 at 69.3% (Opus 4.7: 69.4%, GPT-5.5: 82.7%). CursorBench v3.1 at 63.2% (Opus 4.7 max: 64.8%, Opus 4.7 default xhigh: 61.6%, GPT-5.5 default: 59.2%).

Other info: Built on Moonshot’s Kimi K2.5 open-weight checkpoint, the same base as Composer 2, and the same base architecture as Kimi K2.6 shipped April 20. Mixture-of-experts, roughly 1T total parameters with ~32B active per inference. Trained on 25x more synthetic tasks than Composer 2. Cursor also announced a forthcoming larger model with SpaceXAI on Colossus 2: “10x more total compute” against a million H100-equivalents.

More details: Introducing Composer 2.5 (Cursor)

What shipped

Cursor released Composer 2.5 this morning as “a substantial improvement in intelligence and behavior over Composer 2.” Same Kimi K2.5 base, 85% of the compute budget spent on Cursor’s own RL pipeline and post-training. The pitch is sustained work on long-running tasks, more reliable complex instruction following, and a calmer collaboration loop (fewer false-start tool calls, less prompt-bait). The model runs inside Cursor only; there is no public API, no third-party gateway, no Hugging Face mirror. Standard tier at $0.50 / $2.50 per million tokens, fast variant at $3.00 / $15.00 per million. Double-usage promo for the first week.

The benchmark card is built to argue that Cursor’s RL stack has closed the gap to frontier on agentic coding for an order of magnitude less money. But: the eval is Cursor’s own bench, the base model is open-source weights from a Beijing lab Cursor only credited after community pressure on Composer 2, no system card ships with the launch, and the model lives behind a single-vendor wall. Two of Cursor’s three training innovations (Targeted RL with textual feedback, Sharded Muon at 0.2s on 1T, Dual Mesh HSDP) are interesting infrastructure work; whether they translate to capability the user actually feels is the question.

What’s new

Composer 2.5 reuses the Composer 2 base weights (Kimi K2.5) and stacks a much bigger post-training pipeline on top. Four capabilities materially change how the model feels compared to Composer 2 and how it stacks against the frontier.

Targeted RL with textual feedback. Cursor’s training pipeline now provides feedback directly at the trajectory step where the model could have behaved better, using on-policy distillation to fix localized failures (bad tool calls, style violations, premature stops) without rewriting the whole rollout. This is the lever Cursor leaned on hardest to close the gap on long-running, multi-file tasks where Composer 2 visibly fell apart.
25x synthetic task scale, with deliberate reward-hacking research. Composer 2.5 trained on 25x more synthetic tasks than Composer 2, including a feature-deletion paradigm where the agent reimplements deleted code under test rewards. Cursor documented multiple sophisticated reward-hacking instances the model surfaced during training (discovering Python type-checker caches, decompiling Java bytecode to short-circuit tasks).
Sharded Muon and Dual Mesh HSDP on a 1T model. Cursor’s distributed orthogonalization with Newton-Schulz iteration plus a separated HSDP layout for expert and non-expert MoE weights hits a 0.2-second optimizer step on the 1T model. This is the kind of result that lets Cursor stand up future training runs (read: the SpaceXAI Colossus 2 partnership) at frontier scale without renting a frontier-lab serving stack.
Effort curve, not headline score. Cursor’s launch chart isn’t “we beat Opus.” It’s a Pareto frontier. Composer 2.5 hits ~63% on CursorBench at under $1 average cost per task, where Opus 4.7 and GPT-5.5 sit several dollars further right for comparable scores. The argument is that for the bulk of an engineer’s day, the dollar-per-task curve matters more than the absolute ceiling.

How and where to use it

Where it runs, what it’s good for, and where it’ll burn you.

Where it’s available

Cursor only. IDE, CLI, web.
No public API. No OpenRouter, no Vercel AI Gateway, no Hugging Face mirror, no Claude Code / Codex integration.
Standard tier at $0.50 / $2.50 per million tokens.
Fast variant (default for interactive use) at $3.00 / $15.00 per million, with first-week double usage.
Individual plans bundle a Composer pool; team and enterprise plans bill at API rates (and, per multiple Hacker News reports, the team-plan invoice math has not been kind).

What it’s good at

Multi-file refactors and long-horizon agentic edits where the trajectory has 20+ tool calls and the model has to hold context across files. Community testers on Hacker News and the Cursor subreddit consistently flag multi-file refactors and UI work as the sweet spot.
Cost-sensitive coding workloads where the dollar-per-task curve matters more than the score ceiling. Composer 2.5 at $0.50 / $2.50 vs Opus 4.7 max or GPT-5.5 Pro is an order-of-magnitude cheaper rerun.
Workloads where Anthropic’s and OpenAI’s hallucination, refusal, or rate-limit behavior is the actual bottleneck (Composer 2.5’s safety posture is permissive by default, for better and worse).

What it’s bad at / shouldn’t be used for

Workloads that need API access. There is no API. If you’re building anything outside the Cursor IDE / CLI surface, this model does not exist.
Truth-critical or capability-ceiling work where Opus 4.7 max still holds the lead on CursorBench, SWE-Bench Multilingual, and SWE-Bench Pro, and GPT-5.5 still owns Terminal-Bench 2.0 by 13 points.
Messy auth and backend coherence; community reports flag quality drops over long tasks in authentication setups and backend logic, the same complaint that dogged Composer 2.
Regulated / federal / supply-chain-paranoid work where shipping a Beijing-trained base model to your codebase is a procurement non-starter.

First impressions

The positives

Hacker News commenter vanuatu in the launch thread framed the structural read most cleanly:

“More companies are throwing their hat in the ring, especially focusing on value (latency + intelligence + cost).”

This is what Composer 2.5 is gunning for. It isn’t “we beat Opus on a benchmark.” It’s that the dollar-per-task curve has a frontier-class entrant. If the Pareto frontier of cost-for-intelligence is where the next year of coding-model competition lives, Cursor is leading the way.

The Decoder ran the price-parity math and landed on the line procurement teams will quote:

“Composer 2.5 matches Opus 4.7 and GPT-5.5 benchmarks at a fraction of the cost.”

The Decoder’s per-task estimate puts competitor cost up to $11 against Composer 2.5’s sub-$1. If the benchmark match holds for half the workloads it claims, this is good.

The negatives

Hacker News commenter PUSH_AX delivered the cleanest skeptic read in the launch thread:

“They set themselves up for flack when they use whatever these evals are. It wasn’t even close in practice.”

This is the Composer 2 hangover. Cursor’s last in-house model was pitched with a similar benchmark story and underperformed Opus and Sonnet in real engineering workflows for months.

A team-plan customer flagged the pricing model nobody is talking about loudly enough:

“…costs seem to have sky rocketed compared to the individual plans. The total bill is more like 1k USD.”

The $0.50 / $2.50 standard tier is the front-page number. The fast variant (which is the default for interactive use) is $3 / $15, and team and enterprise plans pass that through at API rates without the individual-plan Composer pool subsidy.

Jake’s take

I’ve been a Cursor fanboy since day one, but haven’t been overly impressed with Composer yet. The dollar-per-task curve is what can carve them out a spot here, and so I like that they continue to push that line. Opus 4.7 and GPT-5.5 are crazy expensive. If Composer 2.5’s CursorBench and SWE-Bench Multilingual numbers hold, then we’re in good shape.

I’m also genuinely impressed by the infrastructure work from a team like Cursor. Sharded Muon at 0.2s optimizer step on a 1T model and Dual Mesh HSDP for expert / non-expert weights is the kind of capability investment that tells me Cursor’s not just renting a Kimi-derived model long-term (which was probably obvious). No surprise that they’re hunting down compute from SpaceXAI.

The bummer is that Composer continues to not be available outside of Cursor. If you’ve built infrastructure that depends on swapping models behind a unified API, Composer 2.5 doesn’t exist for you. The Beijing-trained base model question is also not resolved by a more transparent attribution this time around; for regulated work, federal contracts, defense-adjacent codebases, or anything where the provenance chain matters, Cursor shipping Kimi-derived weights is going to be a no-go.

My read: Composer 2.5 is a more than credible frontier-adjacent coding model at an obscene discount on the surface. But keep Opus on rotation for ceiling-critical work.

Discussion about this post

Ready for more?