Latest in AI

Showing:evalsResearchersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Introducing FrontierCode★ 78
Hacker News (AI keywords)49 days agoBenchmark
Cognition launched FrontierCode, a coding benchmark focused on mergeability rather than only functional correctness. It evaluates correctness, tests, scope discipline, style, and repository-specific quality standards. Built with open-source maintainers and extensive quality control, it shows current frontier models still struggle: Claude Opus 4.8 scores 13.4% on the hardest Diamond subset, ahead of GPT-5.5 and Gemini 3.1 Pro.
Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Latent Space53 days agoBenchmark
Latent Space talks with Lukas Petersson and Axel Backlund of Andon Labs, the authors behind VendingBench. The episode focuses on evaluating Claude models across a range from Haiku to Mythos. It also discusses how they build frontier evals from scratch, with an emphasis on creating benchmarks that remain useful and meaningful over time.
Harness, Scaffold, and the AI Agent Terms Worth Getting Right★ 75
Hugging Face Blog64 days agoTutorial
Hugging Face has published a comprehensive glossary of AI agent terminology to resolve industry-wide confusion. The guide focuses on defining critical concepts such as "scaffold" (the code wrapping the LLM) and "harness" (the evaluation and execution environment). This standardization helps developers and researchers communicate more precisely when building and benchmarking agentic systems.
Vercel 評估指出：使用 AGENTS.md 定義 Agent 表現優於傳統的「技能 (Skills)」設定★ 78
Vercel Changelog182 days agoRelease
In its latest technical blog post, Vercel shared a significant finding regarding AI Agent architecture: in their Agent Evaluations (Agent Evals), using a…
評估驅動開發（Eval-driven development）：更快打造更好的 AI 應用★ 80
Vercel Changelog649 days agoOpinion
As generative AI applications become more widespread, one of the biggest challenges developers face is the "non-deterministic" output of large language models…