Latest in AI

Showing:evaluationResearchersClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80
Hugging Face Blog468 days agoRelease
### Background and Pain Points: Moving Beyond the Overly Simple "Needle in a Haystack" Test In recent years, the context window length supported by large…
Hugging Face 推出阿拉伯語 LLM 評估新標準：引入阿拉伯語指令遵循（IFEval）與更新 AraGen
Hugging Face Blog476 days agoRelease
Hugging Face recently announced a major upgrade to its Arabic Large Language Model (LLM) leaderboard, aiming to provide a more credible and comprehensive…
使用 Arize Phoenix 追蹤與評估你的 smolagents 智慧代理★ 75
Hugging Face Blog515 days agoTutorial
The recently launched `smolagents` from Hugging Face is an AI agent framework that emphasizes being lightweight and code-centric (Code Agent). However, as the…
Hugging Face 推出 Math-Verify：修正 Open LLM Leaderboard 的數學評測偏差★ 78
Hugging Face Blog529 days agoNew Tool
Hugging Face's Open LLM Leaderboard has long served as an important barometer for measuring the capabilities of open-source large language models (LLMs)…
Hugging Face 推出第二代開源阿拉伯語大語言模型排行榜 (Open Arabic LLM Leaderboard 2)
Hugging Face Blog533 days agoRelease
Hugging Face, in collaboration with its partners, has officially launched the "Open Arabic LLM Leaderboard 2.0." With the explosive growth of Arabic large…
評估音訊推理能力：Hugging Face 推出 Big Bench Audio 基準測試★ 75
Hugging Face Blog585 days agoRelease
As multimodal large language models (such as GPT-4o, Gemini, and various open-source audio models) continue to proliferate, AI's ability to process audio has…
LLM 自我糾錯能力有多強？Hugging Face 聯手 Keras 與 TPU 打造競技場實驗★ 75
Hugging Face Blog600 days agoCommentary
As large language models (LLMs) are increasingly applied in software development and logical reasoning, there is growing interest in whether models possess the…
重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face
Hugging Face Blog601 days agoRelease
### Background and Challenges: The Difficulty of Evaluating Non-English LLMs In the current landscape of large language model (LLM) development, evaluating…
讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75
Hugging Face Blog615 days agoRelease
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
Hugging Face 推出全新「開放式日語 LLM 排行榜」，加速日語大語言模型評測★ 75
Hugging Face Blog615 days agoNew Tool
Hugging Face has officially launched the "Open Japanese LLM Leaderboard," a community-driven platform dedicated to evaluating the performance of…
Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80
Hugging Face Blog616 days agoRelease
As large language models (LLMs) have rapidly advanced, traditional static benchmarks (such as MMLU) have increasingly faced saturation and gaming problems. As…
專家支援案例研究：利用 LLM-as-a-Judge 評估機制強化 Digital Green 的 RAG 農業問答應用★ 75
Hugging Face Blog638 days agoTutorial
This case study provides a detailed account of how non-profit organization Digital Green, with support from Hugging Face's Expert Support team, optimized its…
CinePile 2.0：利用對抗性精煉打造更強大的長影片問答資料集★ 75
Hugging Face Blog643 days agoRelease
CinePile is a multimodal question-answering dataset focused on movie and long-video understanding. In traditional dataset construction, researchers commonly…
Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75
Hugging Face Blog662 days agoRelease
Hugging Face has officially launched the "Open FinLLM Leaderboard" — a new platform dedicated to evaluating and tracking the performance of large language…
🇨🇿 BenCzechMark：你的 LLM 能聽懂捷克語嗎？全新捷克語基準測試發布
Hugging Face Blog665 days agoRelease
The Hugging Face team and its collaborators have jointly launched a new benchmark called "BenCzechMark," designed to evaluate the understanding and generation…
Hugging Face 聯合 Artificial Analysis 推出「文字生成圖片」排行榜與競技場★ 75
Hugging Face Blog782 days agoNew Tool
Hugging Face has partnered with independent AI evaluation organization Artificial Analysis to officially launch the "Text to Image Leaderboard & Arena." This…
Meta 推出 CyberSecEval 2：評估大語言模型網路安全風險與防護能力的全面性框架★ 75
Hugging Face Blog795 days agoRelease
As large language models (LLMs) become increasingly prevalent in software development and automated workflows, their "dual-use" risks in the cybersecurity…
Hugging Face 推出 Open Arabic LLM 排行榜，加速阿拉伯語大語言模型評測與發展
Hugging Face Blog805 days agoRelease
Hugging Face has announced the launch of the "Open Arabic LLM Leaderboard," an important initiative aimed at advancing Arabic natural language processing (NLP)…
Hugging Face 推出希伯來語 LLM 開放排行榜，推動非英語系 AI 模型評測
Hugging Face Blog814 days agoRelease
Hugging Face has officially launched the "Open Leaderboard for Hebrew LLMs," an open-source evaluation platform specifically designed for Hebrew large language…
Hugging Face 推出 Open Chain of Thought (CoT) 排行榜：專注評估開源模型的推理與思考鏈能力★ 75
Hugging Face Blog826 days agoRelease
Hugging Face has announced the launch of the new "Open Chain of Thought (CoT) Leaderboard," a public platform specifically designed to evaluate and compare the…
Hugging Face 推出 Open Medical-LLM 排行榜：標準化評估醫療保健領域的大型語言模型★ 75
Hugging Face Blog830 days agoRelease
Hugging Face has announced the official launch of the "Open Medical-LLM Leaderboard" in collaboration with researchers from Open Life Science AI and the…
推出 LiveCodeBench 排行榜：全面且無污染的程式碼大語言模型評估★ 75
Hugging Face Blog833 days agoRelease
As code large language models (Code LLMs) develop rapidly, fairly and accurately evaluating their capabilities has become a major challenge. Traditional…
介紹 Chatbot Guardrails Arena：評估大語言模型安全防護網的全新競技場★ 75
Hugging Face Blog859 days agoRelease
As large language models (LLMs) have been widely adopted across industries, ensuring AI systems remain safe and compliant while preventing harmful outputs has…
Hugging Face 推出 ConTextual 排行榜：評估多模態模型在富含文本場景中的圖文聯合推理能力★ 75
Hugging Face Blog875 days agoRelease
Hugging Face has announced the launch of a new multimodal benchmark and leaderboard called "ConTextual," aimed at addressing the shortcomings of existing…
Hugging Face 與 Upstage 推出 Open Ko-LLM 排行榜：引領韓國大語言模型評估生態系
Hugging Face Blog889 days agoRelease
Hugging Face and South Korea's leading AI startup Upstage have jointly announced the launch of the "Open Ko-LLM Leaderboard." This is a brand-new evaluation…
Hugging Face 推出 NPHardEval 排行榜：透過計算複雜度與動態更新揭示大型語言模型的推理能力★ 75
Hugging Face Blog907 days agoRelease
Hugging Face has announced the launch of the new **NPHardEval** leaderboard — a benchmark specifically designed to evaluate the reasoning capabilities of large…
Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75
Hugging Face Blog909 days agoRelease
Hugging Face has partnered with Patronus AI — a startup focused on LLM evaluation and defense — to officially launch the **Enterprise Scenarios Leaderboard**…
Hugging Face 推出「幻覺排行榜」，開源量化評估大型語言模型的幻覺率★ 75
Hugging Face Blog911 days agoRelease
While large language models (LLMs) have demonstrated remarkable generative capabilities across many domains, "hallucination" — where a model confidently…
Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75
Hugging Face Blog914 days agoRelease
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…
關於 Open LLM 排行榜，到底發生了什麼事？評測分數差異深度解析★ 75
Hugging Face Blog1,131 days agoCommentary
### Background: The Gap Between Leaderboard Scores and Paper Results By mid-2023, Hugging Face's Open LLM Leaderboard had become the community's go-to platform…

← PreviousPage 2Next →

Latest in AI

介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80

Hugging Face 推出阿拉伯語 LLM 評估新標準：引入阿拉伯語指令遵循（IFEval）與更新 AraGen

使用 Arize Phoenix 追蹤與評估你的 smolagents 智慧代理★ 75

Hugging Face 推出 Math-Verify：修正 Open LLM Leaderboard 的數學評測偏差★ 78

Hugging Face 推出第二代開源阿拉伯語大語言模型排行榜 (Open Arabic LLM Leaderboard 2)

評估音訊推理能力：Hugging Face 推出 Big Bench Audio 基準測試★ 75

LLM 自我糾錯能力有多強？Hugging Face 聯手 Keras 與 TPU 打造競技場實驗★ 75

重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face

讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75

Hugging Face 推出全新「開放式日語 LLM 排行榜」，加速日語大語言模型評測★ 75

Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80

專家支援案例研究：利用 LLM-as-a-Judge 評估機制強化 Digital Green 的 RAG 農業問答應用★ 75

CinePile 2.0：利用對抗性精煉打造更強大的長影片問答資料集★ 75

Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75

🇨🇿 BenCzechMark：你的 LLM 能聽懂捷克語嗎？全新捷克語基準測試發布

Hugging Face 聯合 Artificial Analysis 推出「文字生成圖片」排行榜與競技場★ 75

Meta 推出 CyberSecEval 2：評估大語言模型網路安全風險與防護能力的全面性框架★ 75

Hugging Face 推出 Open Arabic LLM 排行榜，加速阿拉伯語大語言模型評測與發展

Hugging Face 推出希伯來語 LLM 開放排行榜，推動非英語系 AI 模型評測

Hugging Face 推出 Open Chain of Thought (CoT) 排行榜：專注評估開源模型的推理與思考鏈能力★ 75

Hugging Face 推出 Open Medical-LLM 排行榜：標準化評估醫療保健領域的大型語言模型★ 75

推出 LiveCodeBench 排行榜：全面且無污染的程式碼大語言模型評估★ 75

介紹 Chatbot Guardrails Arena：評估大語言模型安全防護網的全新競技場★ 75

Hugging Face 推出 ConTextual 排行榜：評估多模態模型在富含文本場景中的圖文聯合推理能力★ 75

Hugging Face 與 Upstage 推出 Open Ko-LLM 排行榜：引領韓國大語言模型評估生態系

Hugging Face 推出 NPHardEval 排行榜：透過計算複雜度與動態更新揭示大型語言模型的推理能力★ 75

Hugging Face 推出「企業情境排行榜」：專為真實世界應用設計的 LLM 評測基準★ 75

Hugging Face 推出「幻覺排行榜」，開源量化評估大型語言模型的幻覺率★ 75

Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75

關於 Open LLM 排行榜，到底發生了什麼事？評測分數差異深度解析★ 75