Agent Foundations for Superintelligence-Robust Alignment

This course is a guide to the cluster of thought which expects solving the alignment problem in a way which scales to superintelligence to require well-specified proposals guided by strong theoretical understanding, and how to proceed despite this.

Get started

Course notes

Read at your own pace, letting any reading group you’re part of catch up if necessary. Each module should take much less than a week of dedicated reading, though actually absorbing and understanding the material may take longer. Use your favorite methods for this, whether that is taking notes, explaining it to a friend, or staring into space with an expression of dawning horror.It's recommended that you do the readings roughly in the order presented. Particularly important readings are in bold font, while some modules also have optional bonus readings at the end.Approach this not with an air of duty, but with curiosity. Let your gut guide you towards absorbing content which helps you grow, skip over content which you find unengaging, and don’t feel compelled to read everything in a strict order.

Module 0: AI Safety Basics

Skip this module if you're already familiar with the basics of AI safety. Otherwise choose one (or more) of the options below, depending on your preferred learning style.

Module 1: Why Expect Doom?

Why is the transition to superintelligence so fraught with danger?

• The basic reasons I expect AGI ruin – Rob Bensinger [2023, 24 mins]
• AGI Ruin: A List of Lethalities – Eliezer Yudkowsky [2022, 36 min]
• Cascades, Cycles, Insight… and ...Recursion, Magic – Eliezer Yudkowsky [2008, 16 mins]
• Alternative: AI Self Improvement - Computerphile – Rob Miles [2015, 11 mins]
• A central AI alignment problem: capabilities generalization, and the sharp left turn – Nate Soares [2022, 12 mins]
• What I mean by "alignment is in large part about making cognition aimable at all" – Nate Soares [2023, 3 mins]

Module 2: Agent Foundations Worldview

Why do some people think that in order to build a mind which deeply values taking care of humanity, we need to have a clear understanding of how agency works, along with a foundation that supports a robust and aligned decision-making system?

• Why Agent Foundations? An Overly Abstract Explanation – John Wentworth [2022, 9 mins]
• AI Alignment: Why It's Hard, and Where to Start – Eliezer Yudkowsky [2016, 1:30 hrs video]
• AGI ruin scenarios are likely (and disjunctive) – Nate Soares [2022, 8 mins]
• Five theses, two lemmas, and a couple of strategic implications – Eliezer Yudkowsky [2013, 6 mins]

Optional bonus resources

• Why Would AI Want to do Bad Things? Instrumental Convergence – Rob Miles [2018, 10 mins video]
• Intelligence and Stupidity: The Orthogonality Thesis – Rob Miles [2018, 13 mins]
• General AI Won't Want You To Fix its Code - Computerphile – Rob Miles [2017, 23 mins]

Module 3: Predictable Challenges

Some failure modes with sufficiently powerful systems kill you if you don’t preemptively avert them. In order to survive the transition to superintelligence, we need to see failure modes coming before we observe them via experiment. This module collects some of the threats which have been identified so far; there is no guarantee that humanity’s knowledge covers every major class of danger.

• Nearest unblocked strategy – Eliezer Yudkowsky [2015, 2 mins]
• Goodhart's Curse – Eliezer Yudkowsky [2016, 13 mins]
• Siren worlds and the perils of over-optimised search – Stuart Armstrong [2014, 8 mins]
• Deep Deceptiveness – Nate Soares [2023, 17 mins]
• Optimization Daemons – Eliezer Yudkowsky [2016, 3 mins]
• Alternative: The Other AI Alignment Problem: Mesa-Optimizers and Inner Alignment – Rob Miles [2021, 23 mins]
• Distant superintelligences can coerce the most probable environment of your AI – Eliezer Yudkowsky [2015, 3 mins]
• Strong cognitive uncontainability – Eliezer Yudkowsky [2015, 3 mins]

Optional bonus resources

• Embedded Agency – Scott Garrabrant, Abram Demski [2018, 1:05 hrs]
• Context Disaster – Eliezer Yudkowsky [2015, 30 mins]
• Risks from Learned Optimization – Evan Hubinger et al. [2019, 58 mins]

Module 4: Mindset

What mindset can we cultivate to stay vigilant in this difficult terrain?

• AI safety mindset – Eliezer Yudkowsky [2015, 2 mins]
• Method of Foreseeable Difficulties – Eliezer Yudkowsky [2015, 2 mins]
• Security Mindset and Ordinary Paranoia – Eliezer Yudkowsky [2017, 35 mins]
• Relevantly powerful agent – Eliezer Yudkowsky [2015, 2 mins]
• Minimality Principle – Eliezer Yudkowsky [2017, 1 min]

Optional bonus resources

• Worst-case thinking in AI alignment – Buck Shlegeris [2021, 8 mins]
• Looking Deeper at Deconfusion – Adam Shimi [2021, 18 mins]
• My research methodology – Paul Christiano [2021, 23 mins]

Module 5: Agendas & Approaches

• 2018 Update: Our New Research Directions – Nate Soares [2018, 41 mins]
• The Plan – John Wentsworth [2021, 17 mins]
• Autonomous AGI – Eliezer Yudkowsky [2015, 1 min]
• Pivotal act – Eliezer Yudkowsky [2015, 9 mins]
• Corrigibility – Nate Soares [2015, 9 mins]

Optional bonus resources

• From the last reading above, either study one proposal deeply or skim several [1–5 hrs]:
       • Physicalist Superimitation (previously called Pre-DCA) [2022]
       • Universal Alignment Test (UAT) [2022]
       • Metaethical AI [2019]
• Agent Foundations for Aligning Machine Intelligence with Human Interests – Nate Soares, Benja Fallenstein [2017, 42 mins]
• MIRI's Approach – Nate Soares [2015, 20 mins]
• The Learning-Theoretic Agenda: Status 2023 – Vanessa Kosoy [2023, 66 mins]