Superalignment Fast Grants
The authors of this article argue that superintelligence, or AI systems with capabilities surpassing human intelligence, could arrive within the next 10 years. While these systems hold potential benefits, they also pose significant risks. Currently, AI systems are aligned for safety using reinforcement learning from human feedback (RLHF). However, aligning future superhuman AI systems will present new technical challenges. Superhuman AI systems will possess complex and creative abilities that humans cannot fully understand. For instance, if a superhuman model generates a million lines of extremely complicated code, humans will struggle to assess whether it is safe to execute. Existing alignment techniques like RLHF that rely on human supervision may no longer be sufficient. Consequently, the challenge at hand is how to steer and trust AI systems that are much smarter than humans. This problem is considered one of the most critical technical problems in the world, but the authors believe it can be solved with a concerted effort. They suggest that the machine learning research community and individual researchers can contribute to major progress on this problem today. The "Superalignment" project aims to gather the best researchers and engineers worldwide to tackle this challenge, with a particular focus on welcoming new participants to the field.
Practices for Governing Agentic AI Systems
This white paper discusses agentic AI systems, which are AI systems that can pursue complex goals with limited direct supervision. While these systems have the potential to help people achieve their goals more efficiently, they also come with risks of harm. The paper suggests a definition for agentic AI systems and identifies the parties involved in their life-cycle. It emphasizes the importance of agreeing on baseline responsibilities and safety best practices for each of these parties. The paper offers an initial set of practices for keeping agents' operations safe and accountable, which can serve as building blocks for developing agreed baseline best practices. It also addresses the questions and uncertainties associated with implementing these practices. Additionally, the paper highlights the indirect impacts of widespread adoption of agentic AI systems, which may require additional governance frameworks. Overall, the goal is to integrate agentic AI systems responsibly into our society to maximize their benefits while minimizing potential harms.
The Superalignment team has released their first paper, addressing the challenge of aligning superintelligent AI systems with human values. They believe that superintelligence, AI systems that surpass human intelligence, could be developed in the next decade, but controlling and steering these systems remains a challenge. Current alignment methods rely on human supervision, but future AI systems will be capable of complex and creative behaviors that make it difficult for humans to effectively supervise them. For instance, superhuman models could write large amounts of code that may be hard for humans to understand and potentially pose risks. This presents a core challenge for AGI alignment: how can weaker supervisors, i.e. humans, trust and control significantly stronger models? The Superalignment team's paper introduces a new research direction for empirically aligning superhuman models, aiming to provide solutions for making future AI systems safe and beneficial to humanity.