Latest in AI

Showing:BenchmarkResearchersOpen-sourceClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding
Latent Space41 days agoRelease
GLM-5.2 has claimed the leading position worldwide among open models on frontend coding benchmarks, marking a significant milestone for the open-source AI ecosystem. The release is accompanied by IndexShare, a new method targeting speculative decoding to improve inference throughput and reduce serving latency. Together, the two developments advance both capability and deployment efficiency for teams building with open models.
TTS Benchmark Revamped with Objective Standards and Blind ELO Voting (46 Models)
r/LocalLLaMA top day49 days agoBenchmark
Reddit user UkieTechie has revamped their TTS benchmark platform with objective scoring standards and live blind voting, now covering 46 speech synthesis models. Hosted on Hugging Face Space, the arena lets users vote on audio quality without knowing the model name, generating a dynamic ELO leaderboard. The project is open-source on GitHub and welcomes community submissions of new models.
Jetson Orin NX Build for Hermes Agent + Benchmarking
r/LocalLLaMA top day49 days agoHardware
The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72
r/LocalLLaMA top day49 days agoRelease
Omi Health’s founder says he fine-tuned NVIDIA Parakeet TDT 0.6B v2 for clinical speech and released Omi Med STT v1 under CC-BY-4.0. The runtime supports Mac, Windows, and Linux, auto-selecting MLX, NeMo, or GGUF/parakeet.cpp backends. In the author’s held-out medical benchmark, it reports 2.37% medical-WER and 145× realtime on local A10 compute.
Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80
Hugging Face Blog71 days agoRelease
Hugging Face and IBM Research have jointly announced the launch of the "Open Agent Leaderboard," aimed at establishing an objective, standardized, and fully…
Hugging Face 為 Open ASR 排行榜引入「防刷榜機制」，使用私有測試數據打擊 Benchmaxxer★ 75
Hugging Face Blog83 days agoRelease
Hugging Face has recently made a major update to its popular Open ASR (Automatic Speech Recognition) leaderboard, aimed at combating the increasingly serious…
QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜
Hugging Face Blog98 days agoRelease
The Technology Innovation Institute (TII) of the United Arab Emirates — the organization behind the well-known open-source model Falcon — has officially…
EVA：ServiceNow AI 推出全新語音 Agent 評估框架★ 75
Hugging Face Blog126 days agoRelease
With the proliferation of GPT-4o, Gemini Live, and various end-to-end voice models, Voice Agents have become an important frontier in AI applications. However…
IBM 與柏克萊加州大學推出 IT-Bench 與 MAST：診斷企業級 AI Agent 失敗原因的全新基準與框架★ 80
Hugging Face Blog160 days agoRelease
### The Pain Points of Enterprise AI Agents in Production: Why Do They Keep Failing? As large language models (LLMs) have rapidly advanced, enterprises have…
OpenEnv 實戰：在真實世界環境中評估具備工具使用能力的 AI Agent★ 75
Hugging Face Blog166 days agoNew Tool
As AI Agent (intelligent agent) technology advances rapidly, evaluating how these agents perform in the real world has become one of the greatest challenges…
Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估
Hugging Face Blog182 days agoRelease
As Arabic large language models (LLMs) develop rapidly, accurately evaluating model performance across different regional dialects has become a significant…
AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75
Hugging Face Blog188 days agoRelease
In today's era of rapid development in AI Agent technology, how to evaluate the performance of these Agents in real-world settings — particularly in industrial…
Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80
Google DeepMind Blog231 days agoRelease
As large language models (LLMs) are deployed across a wide range of industries, ensuring the "factuality" of model outputs and reducing "hallucination" has…
Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75
Hugging Face Blog294 days agoRelease
Hugging Face and the BigCode community have jointly launched a new code model evaluation platform called "BigCodeArena." As AI-assisted coding (such as Copilot…
Hugging Face 推出 Gaia2 與 ARE：賦能社群深入研究 AI Agent★ 85
Hugging Face Blog309 days agoRelease
AI agents are currently the hottest research direction in the AI field, but how to objectively, safely, and reproducibly evaluate agent capabilities has long…
TextQuests：LLM 在文字冒險遊戲中的表現究竟如何？Hugging Face 推出全新評估基準★ 75
Hugging Face Blog350 days agoRelease
Hugging Face has recently introduced a new benchmark called "TextQuests," designed to evaluate the performance of large language models (LLMs) in text-based…
FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場
Hugging Face Blog350 days agoRelease
The Hugging Face team and community have collaborated to launch a new evaluation benchmark called "FilBench," aimed at answering a key question: do large…
TimeScope：評估影片大型多模態模型（Video LMM）長影片理解極限的新基準★ 75
Hugging Face Blog370 days agoRelease
As large multimodal models (LMMs) have achieved breakthroughs in image and short-video understanding, the industry has gradually shifted its attention to the…
ScreenEnv：部署你的全端桌面 AI 代理（Desktop Agent）環境★ 82
Hugging Face Blog383 days agoNew Tool
With the rise of Anthropic's Claude 3.5 Sonnet "Computer Use" and various GUI-oriented multimodal models, "desktop agents" have become one of the hottest areas…
介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80
Hugging Face Blog468 days agoRelease
### Background and Pain Points: Moving Beyond the Overly Simple "Needle in a Haystack" Test In recent years, the context window length supported by large…
Hugging Face 推出 DABStep：評估數據代理多步驟推理能力的全新基準測試★ 75
Hugging Face Blog539 days agoRelease
As large language model (LLM) technology has evolved, AI has transformed from a simple question-answering assistant into an "AI agent" capable of proactively…
評估音訊推理能力：Hugging Face 推出 Big Bench Audio 基準測試★ 75
Hugging Face Blog585 days agoRelease
As multimodal large language models (such as GPT-4o, Gemini, and various open-source audio models) continue to proliferate, AI's ability to process audio has…
重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face
Hugging Face Blog601 days agoRelease
### Background and Challenges: The Difficulty of Evaluating Non-English LLMs In the current landscape of large language model (LLM) development, evaluating…
讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75
Hugging Face Blog615 days agoRelease
This article from the Hugging Face blog introduces "The First Multilingual LLM Debate Competition." As large language models (LLMs) have rapidly advanced…
Hugging Face 推出全新「開放式日語 LLM 排行榜」，加速日語大語言模型評測★ 75
Hugging Face Blog615 days agoNew Tool
Hugging Face has officially launched the "Open Japanese LLM Leaderboard," a community-driven platform dedicated to evaluating the performance of…
Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80
Hugging Face Blog616 days agoRelease
As large language models (LLMs) have rapidly advanced, traditional static benchmarks (such as MMLU) have increasingly faced saturation and gaming problems. As…
Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75
Hugging Face Blog662 days agoRelease
Hugging Face has officially launched the "Open FinLLM Leaderboard" — a new platform dedicated to evaluating and tracking the performance of large language…
🇨🇿 BenCzechMark：你的 LLM 能聽懂捷克語嗎？全新捷克語基準測試發布
Hugging Face Blog665 days agoRelease
The Hugging Face team and its collaborators have jointly launched a new benchmark called "BenCzechMark," designed to evaluate the understanding and generation…
Hugging Face 的 Transformers Code Agent 刷新 GAIA 基準測試紀錄 🏅★ 80
Hugging Face Blog757 days agoRelease
The Hugging Face team published a blog post announcing that their Code Agent, developed using the `transformers` library, achieved a breakthrough score on the…
BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80
Hugging Face Blog770 days agoRelease
As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval benchmark — has…

Page 1Next →

Latest in AI

GLM-5.2: World's Top Open Frontend Coding Model + IndexShare Speculative Decoding

TTS Benchmark Revamped with Objective Standards and Blind ELO Voting (46 Models)

Jetson Orin NX Build for Hermes Agent + Benchmarking

Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72

Hugging Face 與 IBM 聯合推出 Open Agent Leaderboard：開源 AI 智能體效能評測全新基準★ 80

Hugging Face 為 Open ASR 排行榜引入「防刷榜機制」，使用私有測試數據打擊 Benchmaxxer★ 75

QIMMA ⛰：首個品質優先的阿拉伯語大型語言模型（LLM）排行榜

EVA：ServiceNow AI 推出全新語音 Agent 評估框架★ 75

IBM 與柏克萊加州大學推出 IT-Bench 與 MAST：診斷企業級 AI Agent 失敗原因的全新基準與框架★ 80

OpenEnv 實戰：在真實世界環境中評估具備工具使用能力的 AI Agent★ 75

Alyah ⭐️：邁向阿拉伯語大型語言模型中阿聯酋方言能力的強健評估

AssetOpsBench：彌合 AI Agent 評估基準與工業實際應用差距的全新基準測試★ 75

Google DeepMind 推出 FACTS 基準測試套件：系統化評估大型語言模型的真實性★ 80

Hugging Face 推出 BigCodeArena：透過實際執行程式碼進行端到端 Code LLM 評測★ 75

Hugging Face 推出 Gaia2 與 ARE：賦能社群深入研究 AI Agent★ 85

TextQuests：LLM 在文字冒險遊戲中的表現究竟如何？Hugging Face 推出全新評估基準★ 75

FilBench 發布：大型語言模型真的懂菲律賓語嗎？全新評測基準登場

TimeScope：評估影片大型多模態模型（Video LMM）長影片理解極限的新基準★ 75

ScreenEnv：部署你的全端桌面 AI 代理（Desktop Agent）環境★ 82

介紹 HELMET：全面評估長文本語言模型（Long-context LLMs）的新一代基準測試★ 80

Hugging Face 推出 DABStep：評估數據代理多步驟推理能力的全新基準測試★ 75

評估音訊推理能力：Hugging Face 推出 Big Bench Audio 基準測試★ 75

重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face

讓大型模型展開辯論：首屆多語言 LLM 辯論賽★ 75

Hugging Face 推出全新「開放式日語 LLM 排行榜」，加速日語大語言模型評測★ 75

Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80

Hugging Face 推出 Open FinLLM 排行榜：專為金融領域大語言模型打造的開源評測基準★ 75

🇨🇿 BenCzechMark：你的 LLM 能聽懂捷克語嗎？全新捷克語基準測試發布

Hugging Face 的 Transformers Code Agent 刷新 GAIA 基準測試紀錄 🏅★ 80

BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80