Latest in AI

Showing:llm-evaluationGeneralClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Shall We Play a Game? LLMs Use Tactical Nukes in 95% of Simulations
Hacker News (AI keywords)46 days agoCommentary
The available source metadata points to a provocative post about LLM behavior in simulated conflict scenarios. Based only on the title, the central claim is that language models used tactical nuclear weapons in 95% of simulations. Without the article body, the methodology, models tested, prompt design, controls, and validity of the result cannot be assessed.
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
Hacker News (AI keywords)50 days agoPaper
The paper argues that claims about LLMs having human-like attributes, such as morality or language understanding, can be methodologically fragile. By building and training a simple neural network on Age of Empires II, the author suggests such attributes may not be empirically unique to LLMs. The key recommendation is to define explicit measurement criteria and use a null assumption of LLM non-uniqueness before drawing anthropomorphic conclusions.
GPT 5.4 對 Codex 是一大步（但作者為何仍選擇 Claude）★ 80
Interconnects (Nathan L.)132 days agoCommentary
In this article from the well-known AI commentary blog Interconnects, author Nathan L. analyzes GPT 5.4, focusing specifically on the significant changes it…
Open LLM Leaderboard 碳排放與模型性能分析：效能與環保的權衡啟示
Hugging Face Blog565 days agoCommentary
Hugging Face recently published an in-depth analysis of its well-known Open LLM Leaderboard, examining the carbon dioxide (CO₂) emissions generated during…
Hugging Face 推出 Red-Teaming 抗性排行榜：評估 LLM 抵禦惡意越獄與對抗性攻擊的能力★ 75
Hugging Face Blog886 days agoRelease
### Background: The Shortcomings of Static Safety Evaluations As large language models (LLMs) are widely adopted across industries, AI safety has become an…