Latest in AI

Showing:evaluationGeneralClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜
Hugging Face Blog98 days agoRelease
The Technology Innovation Institute (TII) of the United Arab Emirates — the organization behind the well-known open-source model Falcon — has officially…
解讀當前開源與閉源 AI 模型的性能差距：超越單一評估指標的迷思★ 75
Interconnects (Nathan L.)98 days agoOpinion
In today's AI landscape, the performance gap between open-weights models (such as Meta's Llama family) and closed-source models (such as OpenAI's GPT and…
Google DeepMind 推出評估 AGI 進程的「認知框架」，並同步舉辦 Kaggle 黑客松打造全新評估標準★ 85
Google DeepMind Blog132 days agoRelease
As large language models (LLMs) advance rapidly, traditional AI evaluation benchmarks (such as MMLU, GSM8K, and others) are quickly facing the twin challenges…
Import AI 446：核能 LLM、中國大型 AI 基準測試、AI 評估與政策★ 75
Import AI (Jack Clark)155 days agoCommentary
In this edition of Import AI 446, author Jack Clark explores three highly forward-looking and interconnected topics in current AI development: Nuclear LLMs…
Opus 4.6、Codex 5.3 與後基準測試時代：2026 年我們該如何評估 AI 模型？★ 80
Interconnects (Nathan L.)168 days agoOpinion
In 2026, with the release of next-generation models such as Anthropic's Opus 4.6 and OpenAI's Codex 5.3, the AI community faces a fundamental challenge…
AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75
Hugging Face Blog188 days agoRelease
In today's era of rapid development in AI Agent technology, how to evaluate the performance of these Agents in real-world settings — particularly in industrial…
Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80
Google DeepMind Blog231 days agoRelease
As large language models (LLMs) are deployed across a wide range of industries, ensuring the "factuality" of model outputs and reducing "hallucination" has…
給你的 AI 一場面試：如何評估與測試 AI 的真實工作能力★ 80
One Useful Thing (Mollick)258 days agoOpinion
As AI tools (such as ChatGPT, Claude, and others) become more prevalent in the workplace, we are increasingly relying on them for decision-making advice…
回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75
Hugging Face Blog376 days agoRelease
### What is FutureBench? As large language models (LLMs) and AI agents have rapidly advanced, traditional static benchmarks (such as MMLU and GSM8K) face a…
介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80
Hugging Face Blog468 days agoRelease
### Background and Pain Points: Moving Beyond the Overly Simple "Needle in a Haystack" Test In recent years, the context window length supported by large…
讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75
Hugging Face Blog615 days agoRelease
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75
Hugging Face Blog662 days agoRelease
Hugging Face has officially launched the "Open FinLLM Leaderboard" — a new platform dedicated to evaluating and tracking the performance of large language…
Hugging Face 聯合 Artificial Analysis 推出「文字生成圖片」排行榜與競技場★ 75
Hugging Face Blog782 days agoNew Tool
Hugging Face has partnered with independent AI evaluation organization Artificial Analysis to officially launch the "Text to Image Leaderboard & Arena." This…
Hugging Face 與 Upstage 推出 Open Ko-LLM 排行榜：引領韓國大語言模型評估生態系
Hugging Face Blog889 days agoRelease
Hugging Face and South Korea's leading AI startup Upstage have jointly announced the launch of the "Open Ko-LLM Leaderboard." This is a brand-new evaluation…
Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75
Hugging Face Blog909 days agoRelease
Hugging Face has partnered with Patronus AI — a startup focused on LLM evaluation and defense — to officially launch the **Enterprise Scenarios Leaderboard**…
Hugging Face 推出「幻覺排行榜」，開源量化評估大型語言模型的幻覺率★ 75
Hugging Face Blog911 days agoRelease
While large language models (LLMs) have demonstrated remarkable generative capabilities across many domains, "hallucination" — where a model confidently…
Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75
Hugging Face Blog914 days agoRelease
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…

Latest in AI

QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜

解讀當前開源與閉源 AI 模型的性能差距：超越單一評估指標的迷思★ 75

Google DeepMind 推出評估 AGI 進程的「認知框架」，並同步舉辦 Kaggle 黑客松打造全新評估標準★ 85

Import AI 446：核能 LLM、中國大型 AI 基準測試、AI 評估與政策★ 75

Opus 4.6、Codex 5.3 與後基準測試時代：2026 年我們該如何評估 AI 模型？★ 80

AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75

Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80

給你的 AI 一場面試：如何評估與測試 AI 的真實工作能力★ 80

回到未來：Hugging Face 推出 FutureBench 評估 AI Agent 的未來事件預測能力★ 75

介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80

讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75

Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75

Hugging Face 聯合 Artificial Analysis 推出「文字生成圖片」排行榜與競技場★ 75

Hugging Face 與 Upstage 推出 Open Ko-LLM 排行榜：引領韓國大語言模型評估生態系

Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75

Hugging Face 推出「幻覺排行榜」，開源量化評估大型語言模型的幻覺率★ 75

Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75