Model Drop: DeepSeek V4
DeepSeek's open-weight heavy hitter arrives just in time to be outclassed by both OpenAI and Anthropic
Rise and shine! While you were sleeping, DeepSeek dropped V4 of their frontier model… and she’s a bit underwhelming.
Model: DeepSeek V4 preview, two variants. deepseek-v4-pro at 1.6T total / 49B activated parameters. deepseek-v4-flash at 284B total / 13B activated. Both ship as Base and post-trained Instruct checkpoints, with a Mixture-of-Experts architecture. Three reasoning modes on each: Non-think, Think High, Think Max. reasoning_effort accepts high and max (low, medium, and xhigh are silently mapped).
Model type: Text-only. No native image, audio, or video input or output in the preview. Tool calling, JSON mode, chat-prefix completion (beta), and FIM (beta, non-thinking only) are all supported.
Ship date: April 23, 2026
Maker: DeepSeek
Pricing: V4-Flash at $0.028 per 1M input (cache hit) / $0.14 per 1M input (cache miss) / $0.28 per 1M output. V4-Pro at $0.145 per 1M input (cache hit) / $1.74 per 1M input (cache miss) / $3.48 per 1M output. The standard DeepSeek off-peak discount (50% off between roughly 11pm and 7am Beijing time) still applies.
Available on: chat.deepseek.com and the DeepSeek mobile app for end users. DeepSeek API (api.deepseek.com) via OpenAI-compatible ChatCompletions and Anthropic-compatible endpoints; switch by setting model to deepseek-v4-pro or deepseek-v4-flash, base URL unchanged. Open weights on Hugging Face and ModelScope under the MIT license, in FP8 Mixed (Base) and FP4 + FP8 Mixed (Instruct) precision. Pre-tuned adapters for Claude Code, OpenClaw, OpenCode, and CodeBuddy shipped alongside the model.
Headline benchmarks (V4-Pro Max, unless noted): LiveCodeBench Pass@1 at 93.5 (best among all models evaluated, ahead of Gemini 3.1 Pro at 91.7, K2.6 Thinking at 89.6, and Opus 4.6 Max at 88.8). Codeforces rating 3206 (ahead of GPT-5.4 xHigh at 3168 and Gemini 3.1 Pro at 3052). Apex Shortlist Pass@1 at 90.2 (new state of the art, past Gemini’s 89.1 and Opus 4.6’s 85.9). IMOAnswerBench 89.8 (ahead of Opus 4.6’s 75.3; behind GPT-5.4 xHigh’s 91.4). HMMT 2026 Feb Pass@1 95.2 (behind GPT-5.4’s 97.7 and Opus 4.6’s 96.2). GPQA Diamond 90.1. HLE 37.7. SimpleQA-Verified Pass@1 at 57.9 (Gemini 3.1 Pro sits at 75.6; this is the single biggest gap to a current frontier model). SWE-Verified 80.6. SWE-Pro 55.4. Terminal Bench 2.0 at 67.9. MCPAtlas Public at 73.6. MRCR 1M at 83.5. CorpusQA 1M at 62.0.
More details: DeepSeek-V4 Technical Report
What shipped
DeepSeek released V4 as a preview on Thursday and skipped the staged rollout. API, chat product, open weights, and integration adapters for the main agentic coding harnesses all landed the same morning. The pitch the announcement leans on: 1M context as default, frontier-adjacent benchmarks on coding and math, and pricing that’s an order of magnitude below Western closed-source competitors. DeepSeek also flagged that V4-Pro is now the model its own engineers use for internal agentic coding work, describing the experience as “better than Sonnet 4.5, close to Opus 4.6 non-thinking, but still a gap to Opus 4.6 thinking.”
The two-tier structure is the same product-line play OpenAI runs with GPT-5.5 / Mini / Nano, Anthropic runs with Opus / Sonnet / Haiku, and Google runs with Gemini Pro / Flash. DeepSeek has had the capability before; V4 is the first time it’s committed to a named Pro / Flash SKU split in its flagship family. V4-Pro takes the frontier swings. V4-Flash is positioned as “close enough for most workloads, much cheaper, much faster.” Long-horizon agentic tool use and deep factual recall are the parts of Pro you don’t get on Flash.
On the benchmark cards, V4-Pro-Max is compared against Opus 4.6 Max, GPT-5.4 xHigh, Gemini 3.1 Pro High, Kimi K2.6 Thinking, and GLM-5.1 Thinking. Two of those five are now one generation stale. Opus 4.7 shipped on April 16 and GPT-5.5 shipped earlier today. DeepSeek had a full week to rerun its card against Opus 4.7 and chose not to. GPT-5.5 they arguably couldn’t have benchmarked in time, though its existence was public and a note would have cost nothing.
V4 vs. the actual current-gen frontier
DeepSeek’s card does not run this comparison. Here it is, using V4-Pro Max’s numbers against Opus 4.7’s and GPT-5.5’s published scores on the benchmarks where all three overlap.
SWE-Bench Pro
V4-Pro Max: 55.4%
GPT-5.5: 58.6%
Opus 4.7: 64.3%.
SWE-Bench Verified
V4-Pro Max: 80.6%
Opus 4.7: 87.6%
OpenAI didn’t report a GPT-5.5 number on this one
Terminal-Bench 2.0
V4-Pro Max: 67.9%
Opus 4.7: 69.4%
GPT-5.5: 82.7%
GPQA Diamond
V4-Pro Max: 90.1%
Opus 4.7: 94.2%
BrowseComp
V4-Pro Max: 83.4%
GPT-5.5: 84.4%
GPT-5.5 Pro: 90.1%
The consistent pattern: on every overlapping current-gen benchmark, V4-Pro Max trails both Opus 4.7 and GPT-5.5 by 3–15 points, with the biggest gaps on the agentic workloads DeepSeek is positioning V4 for. The “frontier-adjacent” framing is true against the April 1 snapshot of closed-source. Against the April 23 snapshot, V4-Pro is visibly a generation behind on the benchmarks that matter most for the use cases DeepSeek is marketing.
What’s new
DeepSeek V4 is a clean architectural upgrade over V3.2, not a post-training refresh. The technical report lists three real changes, plus two product-level ones that change how the model fits into agent harnesses.
Hybrid attention: CSA + HCA. Compressed Sparse Attention handles selected tokens at higher fidelity, Heavily Compressed Attention sweeps the rest at dramatically reduced precision. Together, DeepSeek reports 27% of V3.2’s per-token inference FLOPs and 10% of its KV cache footprint at 1M context. The practical translation: 1M context becomes a default feature, not a premium SKU. Every API call gets the full window without a surcharge, which is different from what every Western lab is doing today.
Manifold-Constrained Hyper-Connections (mHC). An upgrade to residual connections designed to keep signal propagation stable across deeper MoE stacks while preserving expressivity. DeepSeek claims this was a training-stability unlock that also helps the model stay coherent across very long reasoning traces.
Muon optimizer. DeepSeek switched off AdamW for training and ran V4 on the Muon optimizer, citing faster convergence and better stability at MoE scale. Moonshot K2 was the first major frontier model to ship on Muon. V4 is now the second.
Two-stage post-training with on-policy distillation. DeepSeek trained independent domain-specific expert models (math, coding, agents, knowledge) with SFT plus RL-GRPO, then distilled them into a single unified V4 using on-policy distillation. This is the same direction Anthropic and OpenAI went with domain-specialized training pipelines, and it shows up in the benchmark shape: V4-Pro is simultaneously top-tier on LiveCodeBench / Codeforces / IMOAnswerBench and weaker than Gemini on SimpleQA.
Native adapters for agentic coding harnesses. V4 shipped with documented support for Claude Code, OpenClaw, OpenCode, and CodeBuddy out of the box. Thinking effort auto-upgrades to
maxwhen DeepSeek detects a Claude Code or OpenCode request (interesting). For the self-host crowd, you can drop V4-Pro into an existing Claude Code setup, swap the base URL, and keep the harness unchanged.
How and where to use it
Where it runs, what it’s good for, and where it’ll hurt you.
Where it’s available:
chat.deepseek.com and the DeepSeek mobile apps for consumer chat
The DeepSeek API via OpenAI-compatible
ChatCompletionsand Anthropic-compatible endpoints at the base URLapi.deepseek.com; just swap the model ID todeepseek-v4-proordeepseek-v4-flashOpen weights on Hugging Face and ModelScope under the MIT license. The V4-Flash checkpoint at ~158GB in FP4+FP8 mixed precision will run on a single H200 node; V4-Pro at ~862GB needs a real cluster. Claude Code, OpenClaw, OpenCode, and CodeBuddy all work out of the box with the DeepSeek endpoint. Third-party inference providers (Fireworks, Together, DeepInfra, Novita) are not live at launch but should pick it up within days given the MIT license.
What it’s good at:
Agentic coding inside Claude Code or OpenCode, where DeepSeek’s own engineers now prefer V4-Pro over Sonnet 4.5 for internal work
Competitive coding and algorithm problems (LiveCodeBench 93.5, Codeforces 3206, both leading against the benchmark lineup DeepSeek chose)
Math at the frontier (IMOAnswerBench 89.8, HMMT 2026 95.2, Apex Shortlist 90.2 all best-in-class or near it against that same lineup)
Long-context document work where 1M tokens is the default and the per-token price is a seventh to an eighth of the closed competitors (MRCR 1M at 83.5 and CorpusQA 1M at 62.0 are real scores, not marketing)
Any workload where open weights, MIT license, and on-premise deployment beat “hosted by the best lab”
Chinese language work where C-Eval (93.1 base), CMMLU (90.8 base), and Chinese-SimpleQA (84.4) are the top non-Gemini numbers on the board
What it’s bad at / shouldn’t be used for:
Any task where the head-to-head numbers against Opus 4.7 and GPT-5.5 matter
Knowledge-heavy factual recall where Gemini 3.1 Pro’s 75.6 on SimpleQA-Verified vs V4-Pro’s 57.9 is a real 18-point gap
The most demanding agentic coding workloads where GPT-5.5’s Terminal-Bench 2.0 at 82.7% and Opus 4.7’s SWE-Bench Pro at 64.3% have moved the ceiling
Visual or multi-modal tasks (V4 is text-only in the preview)
Data-sovereignty-sensitive enterprise work where sending your prompts to a Chinese-hosted API is a non-starter; in that case, you self-host the MIT-licensed weights, which is the whole point
Cost-floor low-latency workloads if sub-15ms inter-token latency matters and K2.6 or GPT-5.5 Mini are already wired into your stack
Any production pipeline that was counting on
deepseek-chatordeepseek-reasonerpast July 24, because those IDs sunset with this release
First impressions
OfficeChai’s coverage got at the shape of the release in one sentence:
“V4-Pro is genuinely competitive with GPT-5.4 and Claude Opus 4.6 across most categories, and beats both on coding benchmarks. It trails Gemini-3.1-Pro on general knowledge and HLE, and trails GPT-5.4 on a handful of agentic tasks.”
V4-Pro Max is the best open-weight model in existence today, comfortably past Kimi K2.6 Thinking and GLM-5.1 Thinking on the majority of benchmarks, and a credible alternative to Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro for a lot of workloads.
Startup Fortune pointed at what actually forces a response from the incumbents:
“When a lab with DeepSeek’s benchmark performance publishes these token prices, it becomes harder for any provider to hold their current rate card without a compelling justification... DeepSeek’s most powerful product today might not be the model itself, but the invoice it hands to every CTO who forwards the pricing page to their AI vendor.”
V4-Pro at $3.48 per million output tokens versus Opus 4.7 at $25 and GPT-5.5 at $30 means procurement teams burning real money on frontier closed-source have a 7-to-9x output cost reduction option that shows up on many benchmarks within 5–10 points. V4-Flash at $0.28 per million output tokens lands in Kimi K2.6 budget territory with roughly 2 points of quality distance from V4-Pro on most evals.
Jake’s take
The benchmark lineup is the part I can’t get past, and it’s the part that makes V4 read less like a frontier release and more like a catch-up release. DeepSeek had a full week between Opus 4.7 and their own launch, and they couldn’t be bothered to rerun their card against it.
The comparison lineup they did use (Opus 4.6, GPT-5.4) was superseded a week ago and this morning respectively, in a market where Anthropic and OpenAI are shipping every four to six weeks. This is a research lab claiming a 1.6-trillion-parameter frontier-scale release and a new attention architecture worth a technical report. The minimum professional standard is benchmarking against current-generation leaders. DeepSeek ducked it, and every head-to-head number I was able to run against Opus 4.7 and GPT-5.5 suggests I understand why: V4-Pro Max is 3 to 15 points behind, with the biggest gaps on the agentic-coding surfaces DeepSeek markets as V4’s strength. SWE-Bench Pro at 55.4 against Opus 4.7’s 64.3. SWE-Bench Verified at 80.6 against Opus 4.7’s 87.6. Terminal-Bench 2.0 at 67.9 against GPT-5.5’s 82.7. These are not marginal deltas, and they are not benchmarks DeepSeek could not have run.
The frustrating thing is that V4 didn’t need this framing to be a meaningful release. The MIT license on a 1.6T-parameter model and $3.48 per million output tokens against Opus 4.7’s $25 are more than enough for V4 to make waves. A straight “we are the best open-weight model, here is how we stack up against last-gen and current-gen closed-source, we are not there yet on agentic coding, we are much cheaper, here is the 1M-context architecture paper” would have landed with the weight a frontier lab’s release should carry.
Frontier-class labs don’t compare themselves to last-gen models and hope you don’t notice.




"Any task where the head-to-head numbers against Opus 4.7 and GPT-5.5 matter."
As Han Solo said, "Well, that's the real trick, isn't it? And it will cost you extra!"
Which it certainly will. Which is why I'll be using DeepSeek (or Kimi), not Anthropic or OpenAI.
Anyway, it's about damn time they dropped it. Been waiting it seems like ages.