
I tested Claude Opus 4.6, 4.7, and 4.8 across real production workflows. The newer models score higher on benchmarks but break file creation, Excel parsing, and code execution. 4.6 does what you ask it to do. Newer is not always better when reliability matters more than reasoning scores.
View original source — Hacker Noon ↗



