Superalignment Fast Grants
The arrival of superintelligence, AI systems with vast capabilities, is believed to be possible within the next decade. While these systems can be highly beneficial, they also come with significant risks. Currently, AI systems are aligned and made safe using reinforcement learning from human feedback (RLHF). However, the challenge of aligning future superhuman AI systems is fundamentally different. These systems will possess complex and creative behaviors that humans cannot fully comprehend. Evaluating the safety of such systems, for instance, the execution of a million lines of intricate code generated by a superhuman model, becomes extremely challenging. Existing alignment techniques that rely on human oversight might not be effective anymore. This gives rise to the crucial question of how humans can steer and trust AI systems that are much smarter than them. Although this is one of the world's most important unsolved technical problems, it is believed to be solvable with a concentrated effort. The Superalignment project aims to gather the best researchers and engineers to tackle this challenge, encouraging new individuals to join the field.
Practices for Governing Agentic AI Systems
This white paper focuses on agentic AI systems, which are AI systems that can independently pursue complex goals with limited supervision. While these systems have the potential to be helpful in achieving goals more efficiently, they also pose risks. The authors propose a definition of agentic AI systems and identify the different parties involved in the agentic AI system life-cycle. They emphasize the need to establish baseline responsibilities and safety best practices for each party involved. The paper offers a set of initial practices for ensuring the safety and accountability of these agentic agents. However, the authors acknowledge that there are uncertainties and questions that need to be addressed before these practices can be fully implemented. The paper also highlights the potential indirect impacts of widespread adoption of agentic AI systems, which may require additional governance frameworks. Overall, the paper aims to promote responsible integration of agentic AI systems into society.
The Superalignment team has released its first paper, outlining a new research direction for aligning superhuman artificial intelligence (AI) models. Currently, AI alignment methods rely on human supervision, but future AI systems may possess complex and creative behaviors that make it challenging for humans to supervise effectively. These advanced AI models could generate millions of lines of code, some of which may be potentially dangerous and difficult for even experts to understand. The core challenge for aligning artificial general intelligence (AGI) lies in how weak supervisors, humans in this case, can trust and control significantly stronger AI models. The Superalignment team aims to address this problem by proposing a new approach to empirically aligning superhuman models. The paper represents a crucial step toward ensuring that future AI systems are safe and beneficial to humanity and highlights the urgency of solving the problem of superintelligence alignment within the next decade.