Latest in AI

Showing:evaluationClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Is It Agentic Enough? Benchmarking Open Models on Your Own Tooling
Hugging Face Blog40 days agoBenchmark
Hugging Face published a guide examining whether open-weight models are sufficiently capable for agentic workflows when tested against custom tooling rather than standardized benchmarks. The piece challenges practitioners to move beyond generic leaderboard scores and assess agent performance in the context of their own use cases. It positions open models as viable candidates for production agentic pipelines, provided evaluation is grounded in realistic tool-use scenarios.
FrontierCode: Benchmarking for Code Quality over Slop
Latent Space49 days agoBenchmark
Latent Space briefly announced FrontierCode with the line “We made a thing!” From the title, FrontierCode appears to be a benchmark for frontier coding systems that prioritizes code quality rather than sheer code generation volume. The provided excerpt does not include methodology, model results, datasets, or tooling details, so conclusions should remain cautious.
Introducing Search Toolkit★ 72
Mistral AI News50 days agoNew Tool
Mistral AI introduced Search Toolkit in public preview as a composable framework for AI search infrastructure. It unifies ingestion, retrieval, and evaluation with support for parsing, chunking, embeddings, BM25, dense retrieval, hybrid search, and standard retrieval metrics. The toolkit targets enterprise search, RAG quality improvement, and domain-specific retrieval, with a starter app using Docker, uv, and Vespa.
How to Stop Shipping Low-Quality RL Environments (with Examples)
Latent Space52 days agoTutorial
The post argues that low-quality RL environments are not harmless infrastructure bugs; they can make models worse by feeding them broken learning signals. Based on years of inspecting trajectories, the author highlights recurring environment and harness failures that teams need to fix. The practical lesson is to debug the training environment, grader, and interaction traces before blaming the model or scaling training.
ITBench-AA: Frontier Models Score Below 50% on Enterprise IT Tasks★ 72
Hugging Face Blog61 days agoBenchmark
Artificial Analysis and IBM present ITBench-AA, described in the title as the first benchmark for agentic enterprise IT tasks. The headline result is that frontier models score below 50%, suggesting current systems still struggle with enterprise-grade agent workflows. The original article text is unavailable here, so task design, evaluated models, scoring methodology, and rankings cannot be confirmed.
Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80
Hugging Face Blog70 days agoRelease
Hugging Face and IBM Research have jointly announced the launch of the "Open Agent Leaderboard," aimed at establishing an objective, standardized, and fully…
Hugging Face 為 Open ASR 排行榜引入「防刷榜機制」，使用私有測試數據打擊 Benchmaxxer★ 75
Hugging Face Blog83 days agoRelease
Hugging Face has recently made a major update to its popular Open ASR (Automatic Speech Recognition) leaderboard, aimed at combating the increasingly serious…
QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜
Hugging Face Blog98 days agoRelease
The Technology Innovation Institute (TII) of the United Arab Emirates — the organization behind the well-known open-source model Falcon — has officially…
解讀當前開源與閉源 AI 模型的性能差距：超越單一評估指標的迷思★ 75
Interconnects (Nathan L.)98 days agoOpinion
In today's AI landscape, the performance gap between open-weights models (such as Meta's Llama family) and closed-source models (such as OpenAI's GPT and…
Ecom-RLVE：為電商對話 Agent 打造的自適應可驗證強化學習環境★ 75
Hugging Face Blog103 days agoRelease
As large language models (LLMs) become increasingly widespread, more and more companies are attempting to deploy AI agents in e-commerce customer service and…
深入解析 VAKRA：IBM Research 評估 AI Agent 推理、工具調用與失敗模式的全新基準測試★ 75
Hugging Face Blog103 days agoRelease
As generative AI technology has evolved, the industry's focus has shifted from pure "Large Language Models (LLMs)" to "AI Agents" capable of autonomously…
EVA：ServiceNow AI 推出全新語音 Agent 評估框架★ 75
Hugging Face Blog126 days agoRelease
With the proliferation of GPT-4o, Gemini Live, and various end-to-end voice models, Voice Agents have become an important frontier in AI applications. However…
Google DeepMind 推出評估 AGI 進程的「認知框架」，並同步舉辦 Kaggle 黑客松打造全新評估標準★ 85
Google DeepMind Blog132 days agoRelease
As large language models (LLMs) advance rapidly, traditional AI evaluation benchmarks (such as MMLU, GSM8K, and others) are quickly facing the twin challenges…
Import AI 446：核能 LLM、中國大型 AI 基準測試、AI 評估與政策★ 75
Import AI (Jack Clark)154 days agoCommentary
In this edition of Import AI 446, author Jack Clark explores three highly forward-looking and interconnected topics in current AI development: Nuclear LLMs…
OpenEnv 實戰：在真實世界環境中評估具備工具使用能力的 AI Agent★ 75
Hugging Face Blog166 days agoNew Tool
As AI Agent (intelligent agent) technology advances rapidly, evaluating how these agents perform in the real world has become one of the greatest challenges…
Opus 4.6、Codex 5.3 與後基準測試時代：2026 年我們該如何評估 AI 模型？★ 80
Interconnects (Nathan L.)168 days agoOpinion
In 2026, with the release of next-generation models such as Anthropic's Opus 4.6 and OpenAI's Codex 5.3, the AI community faces a fundamental challenge…
Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估
Hugging Face Blog182 days agoRelease
As Arabic large language models (LLMs) develop rapidly, accurately evaluating model performance across different regional dialects has become a significant…
AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75
Hugging Face Blog188 days agoRelease
In today's era of rapid development in AI Agent technology, how to evaluate the performance of these Agents in real-world settings — particularly in industrial…
開放評測標準：使用 NeMo Evaluator 基準測試 NVIDIA Nemotron 3 Nano★ 70
Hugging Face Blog222 days agoTutorial
As large language models (LLMs) develop in two divergent directions — with extremely large cloud-based models at one end and lightweight "Nano"-scale models…
Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80
Google DeepMind Blog230 days agoRelease
As large language models (LLMs) are deployed across a wide range of industries, ensuring the "factuality" of model outputs and reducing "hallucination" has…
給你的 AI 一場面試：如何評估與測試 AI 的真實工作能力★ 80
One Useful Thing (Mollick)258 days agoOpinion
As AI tools (such as ChatGPT, Claude, and others) become more prevalent in the workplace, we are increasingly relying on them for decision-making advice…
重新思考如何衡量 AI 智慧：Google DeepMind 推出開源評測平台 Game Arena★ 78
Google DeepMind Blog277 days agoNew Tool
With the rapid advancement of artificial intelligence, traditional static benchmarks (such as MMLU and GSM8K) are facing serious challenges. Many frontier…
Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75
Hugging Face Blog294 days agoRelease
Hugging Face and the BigCode community have jointly launched a new code model evaluation platform called "BigCodeArena." As AI-assisted coding (such as Copilot…
Hugging Face 推出 RTEB：全新檢索評估標準，為 RAG 系統打造更真實的測試基準★ 80
Hugging Face Blog300 days agoRelease
As Retrieval-Augmented Generation (RAG) becomes the dominant architecture for enterprises deploying large language models (LLMs), accurately evaluating the…
Hugging Face 推出 Gaia2 與 ARE：賦能社群深入研究 AI Agent★ 85
Hugging Face Blog309 days agoRelease
AI agents are currently the hottest research direction in the AI field, but how to objectively, safely, and reproducibly evaluate agent capabilities has long…
FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場
Hugging Face Blog350 days agoRelease
The Hugging Face team and community have collaborated to launch a new evaluation benchmark called "FilBench," aimed at answering a key question: do large…
📚 3LM：針對阿拉伯語大語言模型在 STEM 與程式碼能力的全新評估基準
Hugging Face Blog360 days agoRelease
The Technology Innovation Institute (TII) of the UAE — the organization behind the Falcon models — has announced on the Hugging Face blog the launch of a new…
回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75
Hugging Face Blog376 days agoRelease
### What is FutureBench? As large language models (LLMs) and AI agents have rapidly advanced, traditional static benchmarks (such as MMLU and GSM8K) face a…
宣布 NeurIPS 2025 E2LM 競賽：聚焦語言模型的早期訓練評估
Hugging Face Blog388 days agoRelease
Hugging Face and the UAE's Technology Innovation Institute (TII, the organization behind the well-known open-source model Falcon) have jointly announced a new…
Hugging Face 推出 ScreenSuite：最全面的 GUI Agent 評估套件！★ 80
Hugging Face Blog417 days agoNew Tool
As artificial intelligence moves beyond simple "text-based conversation" into the era of Agents (intelligent agents) that actively execute tasks, enabling AI…

Page 1Next →

Latest in AI

Is It Agentic Enough? Benchmarking Open Models on Your Own Tooling

FrontierCode: Benchmarking for Code Quality over Slop

Introducing Search Toolkit★ 72

How to Stop Shipping Low-Quality RL Environments (with Examples)

ITBench-AA: Frontier Models Score Below 50% on Enterprise IT Tasks★ 72

Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80

Hugging Face 為 Open ASR 排行榜引入「防刷榜機制」，使用私有測試數據打擊 Benchmaxxer★ 75

QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜

解讀當前開源與閉源 AI 模型的性能差距：超越單一評估指標的迷思★ 75

Ecom-RLVE：為電商對話 Agent 打造的自適應可驗證強化學習環境★ 75

深入解析 VAKRA：IBM Research 評估 AI Agent 推理、工具調用與失敗模式的全新基準測試★ 75

EVA：ServiceNow AI 推出全新語音 Agent 評估框架★ 75

Google DeepMind 推出評估 AGI 進程的「認知框架」，並同步舉辦 Kaggle 黑客松打造全新評估標準★ 85

Import AI 446：核能 LLM、中國大型 AI 基準測試、AI 評估與政策★ 75

OpenEnv 實戰：在真實世界環境中評估具備工具使用能力的 AI Agent★ 75

Opus 4.6、Codex 5.3 與後基準測試時代：2026 年我們該如何評估 AI 模型？★ 80

Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估

AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75

開放評測標準：使用 NeMo Evaluator 基準測試 NVIDIA Nemotron 3 Nano★ 70

Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80

給你的 AI 一場面試：如何評估與測試 AI 的真實工作能力★ 80

重新思考如何衡量 AI 智慧：Google DeepMind 推出開源評測平台 Game Arena★ 78

Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75

Hugging Face 推出 RTEB：全新檢索評估標準，為 RAG 系統打造更真實的測試基準★ 80

Hugging Face 推出 Gaia2 與 ARE：賦能社群深入研究 AI Agent★ 85

FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場

📚 3LM：針對阿拉伯語大語言模型在 STEM 與程式碼能力的全新評估基準

回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75

宣布 NeurIPS 2025 E2LM 競賽：聚焦語言模型的早期訓練評估

Hugging Face 推出 ScreenSuite：最全面的 GUI Agent 評估套件！★ 80