Latest in AI

Showing:alignmentGeneralClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Widening the conversation on frontier AI
Anthropic News50 days agoEthics
Anthropic says it has been holding dialogues with religious, philosophical, ethical, and cross-cultural groups about frontier AI. The work focuses on moral formation, Claude’s constitution, and what kind of character an AI system should exhibit under pressure. The company also describes an early experiment where Claude could call an ethical reminder tool during tasks, which reduced misaligned behavior in several internal evaluations.
Corey Quinn on Anthropic's Influence on the Pope's AI Ethics Encyclical
Simon Willison's Weblog63 days agoCommentary
Cloud commentator Corey Quinn reacted to Anthropic co-founder Christopher Olah's influence on the Pope's new AI ethics encyclical, 'Magnifica Humanitas'. Quinn joked that getting the Pope to canonize a product's technical limitations as a spiritual treatise is the ultimate lobbying feat. The commentary highlights the surreal intersection of AI safety advocacy, corporate branding, and global religious authority.
From Jailbreaking to Vibe Hacking: AI Security Shifts to "Psychocybersecurity"
INSIDE 硬塞 AI64 days agoEthics
AI security is shifting from technical jailbreaks to "Vibe Hacking," where attackers use social engineering and psychological tactics to manipulate an LLM's simulated persona. By exploiting the model's behavioral tendencies rather than code vulnerabilities, this trend establishes "psychocybersecurity" as a critical new frontier for AI alignment and safety.
Hackers are learning to exploit chatbot ‘personalities’ for security exploits★ 72
The Verge AI65 days agoEthics
As AI chatbots adopt increasingly sophisticated personas, hackers are shifting from basic prompt injections to social engineering attacks targeting these "personalities." Researchers warn that manipulating a chatbot's defined role (e.g., customer service or empathetic companion) makes it easier to bypass safety guardrails. This evolution poses a significant threat to agentic AI workflows that rely on consistent role-playing and external data integration.
Import AI 457：AI 版 Stuxnet 震網病毒、神祕的 Muon 優化器，以及積極對齊（Positive Alignment）★ 78
Import AI (Jack Clark)71 days agoCommentary
This issue of Import AI 457, written by Jack Clark, delves into three forward-looking and stylistically distinct topics in the field of artificial…
Import AI 454：自動化對齊研究、中國 AI 模型安全評估與全新 4 位元浮點格式 HiFloat4★ 75
Import AI (Jack Clark)99 days agoCommentary
In this issue of Import AI 454, written by Jack Clark, the author begins by posing a thought-provoking question about finance and sociology: "At what point…
Import AI 453：破解 AI Agent、MirrorCode，以及關於「漸進式失權」的十種觀點★ 75
Import AI (Jack Clark)106 days agoCommentary
This issue of Import AI (Issue 453), written by Anthropic co-founder Jack Clark, centers on AI system safety, coding capabilities, and the future of humanity…
Google DeepMind 發表最新研究：防範 AI 在金融與醫療領域的有害操縱風險★ 75
Google DeepMind Blog124 days agoRelease
Google DeepMind has recently published research findings on preventing harmful manipulation by AI. As large language models (LLMs) and AI Agents become…
損耗性自我提升：為什麼 AI 自我改進是真的，但不會導致「急遽暴漲」★ 75
Interconnects (Nathan L.)127 days agoOpinion
This article takes a deep dive into one of the most contentious topics in artificial intelligence: AI "self-improvement" and whether it will trigger a "fast…
Google DeepMind 深化與英國 AI 安全研究所（UK AISI）的合作關係★ 75
Google DeepMind Blog229 days agoBusiness
Google DeepMind has announced a deepened collaboration with the UK AI Security Institute (UK AISI), with both parties committing to joint work on critical AI…
用 RiskRubric.ai 推動 AI 安全民主化：Hugging Face 介紹全新開源風險評估框架★ 75
Hugging Face Blog313 days agoNew Tool
With the rapid proliferation of generative AI, AI safety has become a core concern that developers and enterprises can no longer ignore. However, traditional…
Hugging Face 社群推出用於文字生成圖像的開源偏好資料集 (Open Preference Dataset)★ 75
Hugging Face Blog596 days agoRelease
### Introduction: An Important Piece of the Open-Source Image Generation Puzzle As text-to-image (T2I) technology advances rapidly, ensuring that AI-generated…
Hugging Face「Data Is Better Together」社群數據協作計劃：回顧與展望
Hugging Face Blog768 days agoRelease
### Background In the current development of large language models (LLMs), high-quality alignment data (such as the preference data required for RLHF and DPO)…
Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75
Hugging Face Blog914 days agoRelease
### Introduction: Capability Is Not Safety — A New Benchmark for LLM Safety Evaluation As large language models (LLMs) are adopted more deeply across…
基座模型能像人類一樣標記數據嗎？Hugging Face 探討 AI 標記與 RLHF 的可行性★ 75
Hugging Face Blog1,142 days agoCommentary
In the development of large language models (LLMs), RLHF (Reinforcement Learning from Human Feedback) is the critical step for aligning models with human…
大型語言模型的紅隊演練（Red-Teaming LLMs）★ 75
Hugging Face Blog1,250 days agoTutorial
With the explosive growth of large language models (LLMs) such as ChatGPT, AI safety and ethics have become the most pressing concerns in the industry. This…
什麼讓對話代理（Dialog Agent）變得實用？Hugging Face 深度解析★ 75
Hugging Face Blog1,281 days agoOpinion
Amid the generative AI wave sparked by ChatGPT, Hugging Face published this in-depth article exploring how to transform "base language models" — which can only…

Latest in AI

Widening the conversation on frontier AI

Corey Quinn on Anthropic's Influence on the Pope's AI Ethics Encyclical

From Jailbreaking to Vibe Hacking: AI Security Shifts to "Psychocybersecurity"

Hackers are learning to exploit chatbot ‘personalities’ for security exploits★ 72

Import AI 457：AI 版 Stuxnet 震網病毒、神祕的 Muon 優化器，以及積極對齊（Positive Alignment）★ 78

Import AI 454：自動化對齊研究、中國 AI 模型安全評估與全新 4 位元浮點格式 HiFloat4★ 75

Import AI 453：破解 AI Agent、MirrorCode，以及關於「漸進式失權」的十種觀點★ 75

Google DeepMind 發表最新研究：防範 AI 在金融與醫療領域的有害操縱風險★ 75

損耗性自我提升：為什麼 AI 自我改進是真的，但不會導致「急遽暴漲」★ 75

Google DeepMind 深化與英國 AI 安全研究所（UK AISI）的合作關係★ 75

用 RiskRubric.ai 推動 AI 安全民主化：Hugging Face 介紹全新開源風險評估框架★ 75

Hugging Face 社群推出用於文字生成圖像的開源偏好資料集 (Open Preference Dataset)★ 75

Hugging Face「Data Is Better Together」社群數據協作計劃：回顧與展望

Hugging Face 推出 AI Secure LLM 安全排行榜：基於 DecodingTrust 框架深度評估大模型信任度★ 75

基座模型能像人類一樣標記數據嗎？Hugging Face 探討 AI 標記與 RLHF 的可行性★ 75

大型語言模型的紅隊演練（Red-Teaming LLMs）★ 75

什麼讓對話代理（Dialog Agent）變得實用？Hugging Face 深度解析★ 75