Latest in AI

Showing:benchmarkingClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Is It Agentic Enough? Benchmarking Open Models on Your Own Tooling
Hugging Face Blog40 days agoBenchmark
Hugging Face published a guide examining whether open-weight models are sufficiently capable for agentic workflows when tested against custom tooling rather than standardized benchmarks. The piece challenges practitioners to move beyond generic leaderboard scores and assess agent performance in the context of their own use cases. It positions open models as viable candidates for production agentic pipelines, provided evaluation is grounded in realistic tool-use scenarios.
Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support
r/LocalLLaMA top day47 days agoRelease
Lemonade v10.7 marks a project-level shift toward working-group-driven development, with 19 contributors involved in the release. The update improves LMX-Omni virtual models for Open WebUI and OpenAI-compatible multimedia clients, introduces the `lemonade bench` CLI, and expands backend support. CUDA, Vulkan, llama.cpp, stable-diffusion.cpp, FastFlowLM, and vLLM are part of the broader push toward cross-vendor local AI performance.
Real-Time LLM Inference on Standard GPUs at 3k Tokens/s per Request
Hacker News (AI keywords)60 days agoBenchmark
The post’s title indicates a performance claim for real-time LLM inference on standard GPUs, reporting 3,000 tokens per second per request. No article body is available, so the underlying model, GPU type, batch size, latency profile, precision, serving stack, and benchmark method are not stated. The item is best treated as an inference-performance benchmark claim rather than a verified deployment guide.
ITBench-AA: Frontier Models Score Below 50% on Enterprise IT Tasks★ 72
Hugging Face Blog61 days agoBenchmark
Artificial Analysis and IBM present ITBench-AA, described in the title as the first benchmark for agentic enterprise IT tasks. The headline result is that frontier models score below 50%, suggesting current systems still struggle with enterprise-grade agent workflows. The original article text is unavailable here, so task design, evaluated models, scoring methodology, and rankings cannot be confirmed.
Import AI 447：AGI 經濟學、用生成式遊戲測試 AI，以及 Agent 生態學的興起★ 75
Import AI (Jack Clark)148 days agoOpinion
In this edition of Import AI 447, Jack Clark takes readers on a deep exploration of the social and technological transformations that artificial general…
開放評測標準：使用 NeMo Evaluator 基準測試 NVIDIA Nemotron 3 Nano★ 70
Hugging Face Blog223 days agoTutorial
As large language models (LLMs) develop in two divergent directions — with extremely large cloud-based models at one end and lightweight "Nano"-scale models…
重新思考如何衡量 AI 智慧：Google DeepMind 推出開源評測平台 Game Arena★ 78
Google DeepMind Blog277 days agoNew Tool
With the rapid advancement of artificial intelligence, traditional static benchmarks (such as MMLU and GSM8K) are facing serious challenges. Many frontier…
在 GCP 上的第五代 Intel Xeon 處理器（C4 執行個體）進行語言模型效能基準測試
Hugging Face Blog588 days agoCommentary
This technical blog post from Hugging Face provides a detailed benchmark of running large language models (LLMs) on Google Cloud Platform's (GCP) new C4…
評測 Text Generation Inference (TGI)：如何量化與優化大語言模型推理性能★ 75
Hugging Face Blog790 days agoTutorial
This official Hugging Face blog post takes an in-depth look at how to benchmark Text Generation Inference (TGI), Hugging Face's open-source LLM inference and…
Hugging Face 推出 Open Arabic LLM 排行榜，加速阿拉伯語大語言模型評測與發展
Hugging Face Blog805 days agoRelease
Hugging Face has announced the launch of the "Open Arabic LLM Leaderboard," an important initiative aimed at advancing Arabic natural language processing (NLP)…
Hugging Face 聯手 Artificial Analysis 推出 LLM 效能與成本排行榜★ 75
Hugging Face Blog816 days agoNew Tool
Hugging Face has announced a partnership with the independent AI performance analytics firm Artificial Analysis, officially integrating its "LLM Performance…
Llama 2 在 Amazon SageMaker 上的部署效能基準測試
Hugging Face Blog1,036 days agoTutorial
This Hugging Face blog post presents detailed performance benchmarks for deploying Meta's open-source large language models — Llama 2 (covering 7B, 13B, and…
關於 Open LLM 排行榜，到底發生了什麼事？評測分數差異深度解析★ 75
Hugging Face Blog1,131 days agoCommentary
### Background: The Gap Between Leaderboard Scores and Paper Results By mid-2023, Hugging Face's Open LLM Leaderboard had become the community's go-to platform…