Latest in AI

Showing:alignmentDevelopersClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Frontier Post-Training Recipe Review with Finbarr Timbers
Interconnects (Nathan L.)42 days agoCommentary
In the 18th installment of his interview series, Interconnects author Nathan Lambert speaks with Finbarr Timbers about the post-training techniques used at frontier AI labs. The conversation examines the methodologies — including supervised fine-tuning, reinforcement learning from human feedback, and preference optimization — that shape model behavior after pretraining. The discussion offers a practitioner's perspective on the evolving landscape of alignment and capability tuning at scale.
Import AI 461: 'Alignment Is Not on Track'; FrontierCode; and Synthetic Research Interns
Import AI (Jack Clark)43 days agoCommentary
Import AI issue 461 covers three AI developments: a prominent claim that alignment research is falling behind capability advances, a new coding-focused tool or benchmark called FrontierCode, and emerging work on synthetic AI agents performing research-intern-level tasks. The issue's framing question — 'Where are your agents right now?' — reflects growing attention to autonomous AI deployment. Together, the stories illustrate a widening gap between AI capability and safety or governance.
Google DeepMind Studies Risks from Millions of Interacting AI Agents
MIT Tech Review AI47 days agoEthics
MIT Technology Review reports that Google DeepMind is funding research into the potential dangers of mass agent interaction online. The concern is that consumer-scale AI agents may soon act without direct human oversight and follow instructions from other agents. The article frames this as an emerging safety and alignment problem, focused less on one model and more on networked agent behavior.
System Card: Claude Fable 5 and Claude Mythos 5★ 82
Hacker News (AI keywords)48 days agoRelease
Anthropic has published system cards for its two newest flagship models, Claude Fable 5 and Claude Mythos 5, following its standard responsible-release practice. These documents cover dangerous capability evaluations, ASL safety-level determinations, red-teaming results, and alignment assessments under the company's Responsible Scaling Policy. They serve as primary references for safety researchers, enterprise buyers, regulators, and developers assessing model risk and deployment suitability.
Widening the conversation on frontier AI
Anthropic News50 days agoEthics
Anthropic says it has been holding dialogues with religious, philosophical, ethical, and cross-cultural groups about frontier AI. The work focuses on moral formation, Claude’s constitution, and what kind of character an AI system should exhibit under pressure. The company also describes an early experiment where Claude could call an ethical reminder tool during tasks, which reduced misaligned behavior in several internal evaluations.
Direct Preference Optimization Beyond Chatbots
Hugging Face Blog55 days agoTutorial
Based only on the title, this Hugging Face Blog post appears to discuss Direct Preference Optimization outside conventional chatbot use cases. It may frame DPO as a broader preference-alignment method for model outputs, workflows, or non-conversational AI systems. Without the full article, specific claims about experiments, datasets, models, or implementation details cannot be verified.
From Jailbreaking to Vibe Hacking: AI Security Shifts to "Psychocybersecurity"
INSIDE 硬塞 AI64 days agoEthics
AI security is shifting from technical jailbreaks to "Vibe Hacking," where attackers use social engineering and psychological tactics to manipulate an LLM's simulated persona. By exploiting the model's behavioral tendencies rather than code vulnerabilities, this trend establishes "psychocybersecurity" as a critical new frontier for AI alignment and safety.
Hackers are learning to exploit chatbot ‘personalities’ for security exploits★ 72
The Verge AI65 days agoEthics
As AI chatbots adopt increasingly sophisticated personas, hackers are shifting from basic prompt injections to social engineering attacks targeting these "personalities." Researchers warn that manipulating a chatbot's defined role (e.g., customer service or empathetic companion) makes it easier to bypass safety guardrails. This evolution poses a significant threat to agentic AI workflows that rely on consistent role-playing and external data integration.
Import AI 457：AI 版 Stuxnet 震網病毒、神祕的 Muon 優化器，以及積極對齊（Positive Alignment）★ 78
Import AI (Jack Clark)71 days agoCommentary
This issue of Import AI 457, written by Jack Clark, delves into three forward-looking and stylistically distinct topics in the field of artificial…
Import AI 454：自動化對齊研究、中國 AI 模型安全評估與全新 4 位元浮點格式 HiFloat4★ 75
Import AI (Jack Clark)99 days agoCommentary
In this issue of Import AI 454, written by Jack Clark, the author begins by posing a thought-provoking question about finance and sociology: "At what point…
Nathan Lambert 的最新進展：ATOM Report、Post-Training 課程、新書與持續進行的 AI 研究★ 70
Interconnects (Nathan L.)104 days agoRelease
Nathan Lambert, a prominent AI expert, former Alignment Scientist at Hugging Face, and founder of the popular newsletter Interconnects, recently wrote about…
Import AI 453：破解 AI Agent、MirrorCode，以及關於「漸進式失權」的十種觀點★ 75
Import AI (Jack Clark)106 days agoCommentary
This issue of Import AI (Issue 453), written by Anthropic co-founder Jack Clark, centers on AI system safety, coding capabilities, and the future of humanity…
Hugging Face 發表 TRL v1.0：專為後訓練（Post-Training）打造的開源庫，邁向 API 穩定與高效對齊新里程碑★ 85
Hugging Face Blog119 days agoRelease
Hugging Face has officially announced the release of TRL (Transformer Reinforcement Learning) v1.0. This is a major milestone, marking TRL's transformation…
Google DeepMind 發表最新研究：防範 AI 在金融與醫療領域的有害操縱風險★ 75
Google DeepMind Blog124 days agoRelease
Google DeepMind has recently published research findings on preventing harmful manipulation by AI. As large language models (LLMs) and AI Agents become…
損耗性自我提升：為什麼 AI 自我改進是真的，但不會導致「急遽暴漲」★ 75
Interconnects (Nathan L.)127 days agoOpinion
This article takes a deep dive into one of the most contentious topics in artificial intelligence: AI "self-improvement" and whether it will trigger a "fast…
Google DeepMind 深化與英國 AI 安全研究所（UK AISI）的合作關係★ 75
Google DeepMind Blog229 days agoBusiness
Google DeepMind has announced a deepened collaboration with the UK AI Security Institute (UK AISI), with both parties committing to joint work on critical AI…
重新思考 Agent 的泛化能力：MiniMax M2 探討「我們究竟在對齊什麼？」★ 75
Hugging Face Blog271 days agoOpinion
This article, published on the Hugging Face Blog, explores one of the most cutting-edge topics in the AI field today: **the challenges of alignment and…
用 RiskRubric.ai 推動 AI 安全民主化：Hugging Face 介紹全新開源風險評估框架★ 75
Hugging Face Blog313 days agoNew Tool
With the rapid proliferation of generative AI, AI safety has become a core concern that developers and enterprises can no longer ignore. However, traditional…
Hugging Face TRL 支援視覺語言模型 (VLM) 對齊：輕鬆實現多模態 DPO 與 ORPO 訓練★ 80
Hugging Face Blog355 days agoRelease
Hugging Face's TRL (Transformer Reinforcement Learning) is a popular open-source library specifically designed for aligning language models (LLMs). In its…
Hugging Face 社群推出用於文字生成圖像的開源偏好資料集 (Open Preference Dataset)★ 75
Hugging Face Blog596 days agoRelease
### Introduction: An Important Piece of the Open-Source Image Generation Puzzle As text-to-image (T2I) technology advances rapidly, ensuring that AI-generated…
重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face
Hugging Face Blog601 days agoRelease
### Background and Challenges: The Difficulty of Evaluating Non-English LLMs In the current landscape of large language model (LLM) development, evaluating…
Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80
Hugging Face Blog616 days agoRelease
As large language models (LLMs) have rapidly advanced, traditional static benchmarks (such as MMLU) have increasingly faced saturation and gaming problems. As…
視覺語言模型（VLM）的偏好最佳化指南：使用 TRL 進行 DPO 微調★ 75
Hugging Face Blog748 days agoTutorial
As vision-language models (VLMs) are increasingly applied to multimodal tasks, how to make these models produce outputs that better align with human…
Hugging Face「Data Is Better Together」社群數據協作計劃：回顧與展望
Hugging Face Blog768 days agoRelease
### Background In the current development of large language models (LLMs), high-quality alignment data (such as the preference data required for RLHF and DPO)…
Hugging Face 推出 RLOO 演算法：降低記憶體消耗，讓強化學習重回 RLHF 主流★ 80
Hugging Face Blog776 days agoRelease
In recent years, methods such as Direct Preference Optimization (DPO) have become mainstream for large language model (LLM) alignment, as they eliminate the…
使用開源 LLM 實作憲政 AI (Constitutional AI)：Hugging Face 的對齊新指南★ 78
Hugging Face Blog908 days agoTutorial
This blog post from Hugging Face provides an in-depth exploration of how to implement "Constitutional AI (CAI)" using open-source large language models (Open…
Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75
Hugging Face Blog914 days agoRelease
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…
使用直接偏好最佳化 (DPO) 方法對 LLM 進行偏好微調 (Preference Tuning)★ 80
Hugging Face Blog922 days agoTutorial
This technical blog post from Hugging Face takes an in-depth look at the latest techniques in "preference tuning," with a particular focus on **Direct…
深入剖析：使用 PPO 進行 RLHF 的 N 個關鍵實作細節★ 85
Hugging Face Blog1,008 days agoTutorial
This technical blog post from Hugging Face takes an in-depth look at the critical "implementation details" that are routinely glossed over in academic papers…
使用 DPO 微調 Llama 2：Hugging Face TRL 實作指南★ 80
Hugging Face Blog1,085 days agoTutorial
### Background and Pain Points Traditional RLHF (Reinforcement Learning from Human Feedback), while achieving enormous success with models like ChatGPT…

Page 1Next →

Latest in AI

Frontier Post-Training Recipe Review with Finbarr Timbers

Import AI 461: 'Alignment Is Not on Track'; FrontierCode; and Synthetic Research Interns

Google DeepMind Studies Risks from Millions of Interacting AI Agents

System Card: Claude Fable 5 and Claude Mythos 5★ 82

Widening the conversation on frontier AI

Direct Preference Optimization Beyond Chatbots

From Jailbreaking to Vibe Hacking: AI Security Shifts to "Psychocybersecurity"

Hackers are learning to exploit chatbot ‘personalities’ for security exploits★ 72

Import AI 457：AI 版 Stuxnet 震網病毒、神祕的 Muon 優化器，以及積極對齊（Positive Alignment）★ 78

Import AI 454：自動化對齊研究、中國 AI 模型安全評估與全新 4 位元浮點格式 HiFloat4★ 75

Nathan Lambert 的最新進展：ATOM Report、Post-Training 課程、新書與持續進行的 AI 研究★ 70

Import AI 453：破解 AI Agent、MirrorCode，以及關於「漸進式失權」的十種觀點★ 75

Hugging Face 發表 TRL v1.0：專為後訓練（Post-Training）打造的開源庫，邁向 API 穩定與高效對齊新里程碑★ 85

Google DeepMind 發表最新研究：防範 AI 在金融與醫療領域的有害操縱風險★ 75

損耗性自我提升：為什麼 AI 自我改進是真的，但不會導致「急遽暴漲」★ 75

Google DeepMind 深化與英國 AI 安全研究所（UK AISI）的合作關係★ 75

重新思考 Agent 的泛化能力：MiniMax M2 探討「我們究竟在對齊什麼？」★ 75

用 RiskRubric.ai 推動 AI 安全民主化：Hugging Face 介紹全新開源風險評估框架★ 75

Hugging Face TRL 支援視覺語言模型 (VLM) 對齊：輕鬆實現多模態 DPO 與 ORPO 訓練★ 80

Hugging Face 社群推出用於文字生成圖像的開源偏好資料集 (Open Preference Dataset)★ 75

重新思考阿拉伯語大模型評估：AraGen 基準測試與 3C3H 評估框架上線 Hugging Face

Hugging Face 與 Atla 推出「Judge Arena」：評估 LLM 作為裁判能力的全新基準測試★ 80

視覺語言模型（VLM）的偏好最佳化指南：使用 TRL 進行 DPO 微調★ 75

Hugging Face「Data Is Better Together」社群數據協作計劃：回顧與展望

Hugging Face 推出 RLOO 演算法：降低記憶體消耗，讓強化學習重回 RLHF 主流★ 80

使用開源 LLM 實作憲政 AI (Constitutional AI)：Hugging Face 的對齊新指南★ 78

Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75

使用直接偏好最佳化 (DPO) 方法對 LLM 進行偏好微調 (Preference Tuning)★ 80

深入剖析：使用 PPO 進行 RLHF 的 N 個關鍵實作細節★ 85

使用 DPO 微調 Llama 2：Hugging Face TRL 實作指南★ 80