Introducing FrontierCode

Original: FrontierCode

Cognition introduced FrontierCode, a benchmark for whether AI-written code is truly mergeable in production codebases.

Cognition launched FrontierCode, a coding benchmark focused on mergeability rather than only functional correctness. It evaluates correctness, tests, scope discipline, style, and repository-specific quality standards. Built with open-source maintainers and extensive quality control, it shows current frontier models still struggle: Claude Opus 4.8 scores 13.4% on the hardest Diamond subset, ahead of GPT-5.5 and Gemini 3.1 Pro.

Cognition has released FrontierCode, arguing that existing coding benchmarks are no longer adequate for measuring the real production capabilities of AI coding agents. The article notes that previous evaluations like SWE-Bench Verified and SWE-Bench Pro primarily tested functional correctness — but once models can write code that passes tests, the more important question becomes: would this PR actually be merged by a maintainer? FrontierCode therefore shifts the evaluation focus to mergeability, which encompasses end-to-end code quality including behavioral correctness, regression safety, mechanical checks (build/lint/style), whether agent-written tests are effective, whether the scope of changes is appropriately restrained, and whether the code conforms to project conventions and is readable.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.