Latest in AI

Showing:llm-evaluationClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

OpenRouter Royale: Last Agent Standing — Claude or Grok?
Hacker News (AI keywords)40 days agoBenchmark
OpenRouter's 'Royale: Last Agent Standing' frames AI model selection as a high-stakes elimination contest for autonomous agents. The post provocatively asks which model — Claude or Grok — you would trust when an AI agent is acting in the real world on your behalf. It positions agentic model choice as a critical, consequential decision rather than a casual preference.
Shall We Play a Game? LLMs Use Tactical Nukes in 95% of Simulations
Hacker News (AI keywords)46 days agoCommentary
The available source metadata points to a provocative post about LLM behavior in simulated conflict scenarios. Based only on the title, the central claim is that language models used tactical nuclear weapons in 95% of simulations. Without the article body, the methodology, models tested, prompt design, controls, and validity of the result cannot be assessed.
Rails testing on autopilot: Building an agent that writes what developers won't
Mistral AI News50 days agoTutorial
Mistral AI describes an autonomous Rails testing agent built on its open-source Vibe coding assistant. The agent reads Rails files, applies file-type-specific skills, generates or improves RSpec tests, and validates them with RuboCop, RSpec, and SimpleCov. In a 275-file experiment, it reached 100% passing tests, 100% average line coverage, zero RuboCop violations, and a higher LLM-as-a-judge score, while stressing that generated tests must actually run.
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
Hacker News (AI keywords)50 days agoPaper
The paper argues that claims about LLMs having human-like attributes, such as morality or language understanding, can be methodologically fragile. By building and training a simple neural network on Age of Empires II, the author suggests such attributes may not be empirically unique to LLMs. The key recommendation is to define explicit measurement criteria and use a null assumption of LLM non-uniqueness before drawing anthropomorphic conclusions.
Nathan Lambert 的最新進展：ATOM Report、Post-Training 課程、新書與持續進行的 AI 研究★ 70
Interconnects (Nathan L.)104 days agoRelease
Nathan Lambert, a prominent AI expert, former Alignment Scientist at Hugging Face, and founder of the popular newsletter Interconnects, recently wrote about…
GPT 5.4 對 Codex 是一大步（但作者為何仍選擇 Claude）★ 80
Interconnects (Nathan L.)131 days agoCommentary
In this article from the well-known AI commentary blog Interconnects, author Nathan L. analyzes GPT 5.4, focusing specifically on the significant changes it…
Open LLM Leaderboard 碳排放與模型性能分析：效能與環保的權衡啟示
Hugging Face Blog565 days agoCommentary
Hugging Face recently published an in-depth analysis of its well-known Open LLM Leaderboard, examining the carbon dioxide (CO₂) emissions generated during…
BigCodeBench：下一代 Code LLM 評測基準 HumanEval 的繼承者★ 80
Hugging Face Blog770 days agoRelease
As large language models (LLMs) have made tremendous strides in code generation, the long-standing industry gold standard — the HumanEval benchmark — has…
使用結構化生成提升 Prompt 一致性與輸出評估★ 75
Hugging Face Blog819 days agoTutorial
When developing applications based on large language models (LLMs) — such as AI agents, RAG systems, or automated workflows — one of the biggest challenges…
Hugging Face 推出 Red-Teaming 抗性排行榜：評估 LLM 抵禦惡意越獄與對抗性攻擊的能力★ 75
Hugging Face Blog886 days agoRelease
### Background: The Shortcomings of Static Safety Evaluations As large language models (LLMs) are widely adopted across industries, AI safety has become an…
Open LLM Leaderboard：深入解析 DROP 基準測試與模型「刷榜」現象★ 75
Hugging Face Blog970 days agoCommentary
The Hugging Face Open LLM Leaderboard has long served as an important benchmark for the community to evaluate the capabilities of open-source models. However…