software

GPT-5.5, DeepSeek V4 and Claude Opus 4.7: the new benchmark war

Original source

The AI race shifted again this week with three launches that do not compete on exactly the same scoreboard, but still tell us a lot about where the market is heading: OpenAI’s GPT-5.5, Anthropic’s Claude Opus 4.7, and DeepSeek V4. The real story is not just who made the loudest announcement. It is who showed the best performance in practical work: coding, long-form reasoning, tool use, large-context handling, and cost per request.

The first editorial caveat matters: these benchmarks are not perfectly apples-to-apples. OpenAI and Anthropic publish their own metrics around agentic tasks, coding, and long-horizon work, while DeepSeek leans on a mix of technical claims, third-party coverage, and a very aggressive efficiency narrative. Even so, when you put them side by side, a clear strategy emerges for each lab.

OpenAI: GPT-5.5 is built for agentic work

OpenAI describes GPT-5.5 as its most intelligent and intuitive model yet, with a clear focus on writing and debugging code, researching online, analyzing data, and moving across tools until a task is finished. In the company’s official post, GPT-5.5 is framed as especially strong in agentic coding, computer use, knowledge work, and early scientific research. In plain English: the pitch is not just better answers, but more work done with less supervision.

The numbers are strong too. OpenAI says GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld-Verified. The company also emphasizes a useful operational detail: GPT-5.5 keeps per-token latency close to GPT-5.4, while using fewer tokens to complete the same Codex tasks. That suggests gains in both capability and efficiency.

Claude Opus 4.7: less spectacle, more consistency

Anthropic’s approach is different. Claude Opus 4.7 is presented as a meaningful upgrade over Opus 4.6, especially for advanced software engineering, long-running tasks, and precise instruction following. The company says its 93-task coding benchmark improved by 13% versus Opus 4.6. It also reports a 0.715 score on its internal research-agent benchmark, with the best performance in long-context work and data discipline.

The technical takeaway is that Claude is not trying to win by hype, but by consistency, precision, and better behavior on tasks where the model must verify its own steps before replying. Anthropic also makes clear that Opus 4.7 remains below its more ambitious Claude Mythos Preview, but it still positions Opus 4.7 as the strongest public model for serious work.

Price is also part of the pitch: Anthropic keeps Opus 4.6 pricing unchanged, which makes the upgrade easier to sell to developers.

DeepSeek V4: the efficiency threat

DeepSeek is not trying to win on prestige. It is trying to win on economics. Its new V4 family uses a Mixture-of-Experts design with 1.6 trillion total parameters, 49 billion active parameters, and a 1 million token context window. TechCrunch reported that DeepSeek says V4 and V4 Pro close the gap with frontier models in several reasoning tasks, and that coding competition benchmarks are comparable to GPT-5.4.

The more aggressive part is pricing. DeepSeek V4 Pro is listed at $0.145 per million input tokens and $3.48 per million output tokens, far below GPT-5.5 and Claude Opus 4.7. That is its real weapon: not necessarily winning every benchmark, but showing that a very capable model can be far cheaper to run. TechCrunch also notes DeepSeek’s own warning that it still trails state-of-the-art frontier models by roughly 3 to 6 months in general knowledge.

So who wins?

If the question is who appears to lead the public quality race, OpenAI currently looks strongest with GPT-5.5. If the question is who seems most reliable for long, exacting work, Claude Opus 4.7 remains the model to watch. And if the question is who puts the most pressure on the market through pricing and access, DeepSeek V4 is the biggest disruptor.

The real conclusion is more useful than any single winner: model competition is no longer decided by raw intelligence alone. It is now a blend of benchmark strength, cost, latency, context, trust, and real-world usefulness. On that board, OpenAI leads visibility, Claude leads consistency, and DeepSeek leads economic pressure.

Verified sources: OpenAI, Anthropic, TechCrunch, and VentureBeat.

Source: OpenAI, Anthropic, TechCrunch, VentureBeat

ACIAPR AI News

Artificial intelligence news curated with context, verified through reliable sources, and more...

GPT-5.5, DeepSeek V4 and Claude Opus 4.7: the new benchmark war

OpenAI: GPT-5.5 is built for agentic work

Claude Opus 4.7: less spectacle, more consistency

DeepSeek V4: the efficiency threat

So who wins?