It’s that AI quality is slippery even for teams that obsess over measurement. For everyone else, vibes are a liability. So ...
Anthropic, of all companies, just shipped three quality regressions in Claude Code that its own evals didn’t catch. Think ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results