Video Coding Benchmarks

Leaked DeepSeek V4 Benchmarks Reveal a Massive 1-Million Token Context Window

Leaked DeepSeek V4 benchmarks claim a 1M token context and multimodal support, but sources remain unverified and ...

Anthropic Claude Opus 4.5 Tops Coding Benchmarks While Slashing Token Use

What if the future of coding wasn’t human, but instead powered by an AI so advanced it could outpace even the most skilled developers? Enter Claude Opus 4.5, a model that doesn’t just assist with ...

IEEE Spectrum on MSN

Why are large language models so terrible at video games?

AI models code simple games, but struggle to play them ...

SiliconANGLE

Study finds newer LLMs introduce more severe coding bugs despite higher benchmark scores

A new report today from code quality testing startup SonarSource SA is warning that while the latest large language models may be getting better at passing coding benchmarks, at the same time they are ...

VentureBeat

Microsoft’s GRIN-MoE AI model takes on coding and math, beating competitors in key benchmarks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft has unveiled a groundbreaking artificial intelligence model, ...

MIT Technology Review

How to build a better AI benchmark

To fix the way we test and measure models, AI is learning tricks from social science. It’s not easy being one of Silicon Valley’s favorite benchmarks. SWE-Bench (pronounced “swee bench”) launched in ...

Forbes

Breaking Down The Latest AI Developer Benchmark From CodeSignal

CodeSignal, which makes skills assessment and AI-powered learning tools, recently released an interesting new benchmark study on the performance of AI code assistance against human developers. The big ...

eWeek

Gemini Beats Claude, GPT in Google’s First Android AI Coding Benchmark

In this episode of eSpeaks, Jennifer Margles, Director of Product Management at BMC Software, discusses the transition from traditional job scheduling to the era of the autonomous enterprise. eSpeaks’ ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results