Latest in AI

Showing:BenchmarkClaudeClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Claude Fable 5 First-Day Hands-On Tests Draw Strong Reactions
量子位 QbitAI48 days agoBenchmark
QbitAI reports that Anthropic’s Claude Fable 5 quickly drew widespread hands-on testing after release. Examples include Minecraft UI generation, Photoshop-like creative tools, browser games, websites, Three.js scenes, and coding tasks. The article highlights impressive demos and benchmark claims, but also notes failures in large codebase refactoring and high usage costs.
Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80
Hugging Face Blog70 days agoRelease
Hugging Face and IBM Research have jointly announced the launch of the "Open Agent Leaderboard," aimed at establishing an objective, standardized, and fully…
Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80
Google DeepMind Blog231 days agoRelease
As large language models (LLMs) are deployed across a wide range of industries, ensuring the "factuality" of model outputs and reducing "hallucination" has…
TextQuests：LLM 在文字冒險遊戲中的表現究竟如何？Hugging Face 推出全新評估基準★ 75
Hugging Face Blog350 days agoRelease
Hugging Face has recently introduced a new benchmark called "TextQuests," designed to evaluate the performance of large language models (LLMs) in text-based…
回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75
Hugging Face Blog376 days agoRelease
### What is FutureBench? As large language models (LLMs) and AI agents have rapidly advanced, traditional static benchmarks (such as MMLU and GSM8K) face a…
ScreenEnv：部署你的全端桌面 AI 代理（Desktop Agent）環境★ 82
Hugging Face Blog383 days agoNew Tool
With the rise of Anthropic's Claude 3.5 Sonnet "Computer Use" and various GUI-oriented multimodal models, "desktop agents" have become one of the hottest areas…
介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80
Hugging Face Blog468 days agoRelease
### Background and Pain Points: Moving Beyond the Overly Simple "Needle in a Haystack" Test In recent years, the context window length supported by large…
Hugging Face 推出 DABStep：評估數據代理多步驟推理能力的全新基準測試★ 75
Hugging Face Blog539 days agoRelease
As large language model (LLM) technology has evolved, AI has transformed from a simple question-answering assistant into an "AI agent" capable of proactively…
重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face
Hugging Face Blog601 days agoRelease
### Background and Challenges: The Difficulty of Evaluating Non-English LLMs In the current landscape of large language model (LLM) development, evaluating…
讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75
Hugging Face Blog615 days agoRelease
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80
Hugging Face Blog616 days agoRelease
As large language models (LLMs) have rapidly advanced, traditional static benchmarks (such as MMLU) have increasingly faced saturation and gaming problems. As…
BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80
Hugging Face Blog770 days agoRelease
As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval benchmark — has…
推出 LiveCodeBench 排行榜：全面且無污染的程式碼大語言模型評估★ 75
Hugging Face Blog833 days agoRelease
As code large language models (Code LLMs) develop rapidly, fairly and accurately evaluating their capabilities has become a major challenge. Traditional…
Hugging Face 推出 ConTextual 排行榜：評估多模態模型在富含文本場景中的圖文聯合推理能力★ 75
Hugging Face Blog875 days agoRelease
Hugging Face has announced the launch of a new multimodal benchmark and leaderboard called "ConTextual," aimed at addressing the shortcomings of existing…
Hugging Face 推出 NPHardEval 排行榜：透過計算複雜度與動態更新揭示大型語言模型的推理能力★ 75
Hugging Face Blog907 days agoRelease
Hugging Face has announced the launch of the new **NPHardEval** leaderboard — a benchmark specifically designed to evaluate the reasoning capabilities of large…
Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75
Hugging Face Blog909 days agoRelease
Hugging Face has partnered with Patronus AI — a startup focused on LLM evaluation and defense — to officially launch the **Enterprise Scenarios Leaderboard**…

Latest in AI

Claude Fable 5 First-Day Hands-On Tests Draw Strong Reactions

Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80

Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80

TextQuests：LLM 在文字冒險遊戲中的表現究竟如何？Hugging Face 推出全新評估基準★ 75

回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75

ScreenEnv：部署你的全端桌面 AI 代理（Desktop Agent）環境★ 82

介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80

Hugging Face 推出 DABStep：評估數據代理多步驟推理能力的全新基準測試★ 75

重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face

讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75

Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80

BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80

推出 LiveCodeBench 排行榜：全面且無污染的程式碼大語言模型評估★ 75

Hugging Face 推出 ConTextual 排行榜：評估多模態模型在富含文本場景中的圖文聯合推理能力★ 75

Hugging Face 推出 NPHardEval 排行榜：透過計算複雜度與動態更新揭示大型語言模型的推理能力★ 75

Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75