Agent Foundations for Superintelligence-Robust Alignment

This course is a guide to the cluster of thought which expects solving the alignment problem in a way which scales to superintelligence to require well-specified proposals guided by strong theoretical understanding, and how to proceed despite this.

Course notes

Read at your own pace, letting any reading group you’re part of catch up if necessary. Each module should take much less than a week of dedicated reading, though actually absorbing and understanding the material may take longer. Use your favorite methods for this, whether that is taking notes, explaining it to a friend, or staring into space with an expression of dawning horror.It's recommended that you do the readings roughly in the order presented. Particularly important readings are in bold font, while some modules also have optional bonus readings at the end.Approach this not with an air of duty, but with curiosity. Let your gut guide you towards absorbing content which helps you grow, skip over content which you find unengaging, and don’t feel compelled to read everything in a strict order.

Module 0: AI Safety Basics

Skip this module if you're already familiar with the basics of AI safety. Otherwise choose one (or more) of the options below, depending on your preferred learning style.


Module 1: Why Expect Doom?

Why is the transition to superintelligence so fraught with danger?

The basic reasons I expect AGI ruin – Rob Bensinger [2023, 24 mins]
AGI Ruin: A List of Lethalities – Eliezer Yudkowsky [2022, 36 min]
Cascades, Cycles, Insight… and ...Recursion, Magic – Eliezer Yudkowsky [2008, 16 mins]
       • Alternative: AI Self Improvement - Computerphile – Rob Miles [2015, 11 mins]
A central AI alignment problem: capabilities generalization, and the sharp left turn – Nate Soares [2022, 12 mins]
What I mean by "alignment is in large part about making cognition aimable at all" – Nate Soares [2023, 3 mins]

Module 2: Agent Foundations Worldview

Why do some people think that in order to build a mind which deeply values taking care of humanity, we need to have a clear understanding of how agency works, along with a foundation that supports a robust and aligned decision-making system?

Why Agent Foundations? An Overly Abstract Explanation – John Wentworth [2022, 9 mins]
AI Alignment: Why It's Hard, and Where to Start – Eliezer Yudkowsky [2016, 1:30 hrs video]
AGI ruin scenarios are likely (and disjunctive) – Nate Soares [2022, 8 mins]
Five theses, two lemmas, and a couple of strategic implications – Eliezer Yudkowsky [2013, 6 mins]

Optional bonus resources

Why Would AI Want to do Bad Things? Instrumental Convergence – Rob Miles [2018, 10 mins video]
Intelligence and Stupidity: The Orthogonality Thesis – Rob Miles [2018, 13 mins]
General AI Won't Want You To Fix its Code - Computerphile – Rob Miles [2017, 23 mins]

Module 3: Predictable Challenges

Some failure modes with sufficiently powerful systems kill you if you don’t preemptively avert them. In order to survive the transition to superintelligence, we need to see failure modes coming before we observe them via experiment. This module collects some of the threats which have been identified so far; there is no guarantee that humanity’s knowledge covers every major class of danger.

Nearest unblocked strategy – Eliezer Yudkowsky [2015, 2 mins]
Goodhart's Curse – Eliezer Yudkowsky [2016, 13 mins]
Siren worlds and the perils of over-optimised search – Stuart Armstrong [2014, 8 mins]
Deep Deceptiveness – Nate Soares [2023, 17 mins]
Optimization Daemons – Eliezer Yudkowsky [2016, 3 mins]
       • Alternative: The Other AI Alignment Problem: Mesa-Optimizers and Inner Alignment – Rob Miles [2021, 23 mins]
Distant superintelligences can coerce the most probable environment of your AI – Eliezer Yudkowsky [2015, 3 mins]
Strong cognitive uncontainability – Eliezer Yudkowsky [2015, 3 mins]

Optional bonus resources

Embedded Agency – Scott Garrabrant, Abram Demski [2018, 1:05 hrs]
Context Disaster – Eliezer Yudkowsky [2015, 30 mins]
Risks from Learned Optimization – Evan Hubinger et al. [2019, 58 mins]

Module 4: Mindset

What mindset can we cultivate to stay vigilant in this difficult terrain?

AI safety mindset – Eliezer Yudkowsky [2015, 2 mins]
Method of Foreseeable Difficulties – Eliezer Yudkowsky [2015, 2 mins]
Security Mindset and Ordinary Paranoia – Eliezer Yudkowsky [2017, 35 mins]
Relevantly powerful agent – Eliezer Yudkowsky [2015, 2 mins]
Minimality Principle – Eliezer Yudkowsky [2017, 1 min]
       • Alternative perspective: My current outlook on AI risk mitigation – Tamsin Leake [2022, 14 mins]

Optional bonus resources

Worst-case thinking in AI alignment – Buck Shlegeris [2021, 8 mins]
Looking Deeper at Deconfusion – Adam Shimi [2021, 18 mins]
My research methodology – Paul Christiano [2021, 23 mins]

Module 5: Agendas & Approaches

2018 Update: Our New Research Directions – Nate Soares [2018, 41 mins]
Orthogonal's Formal-Goal Alignment theory of change – Tamsin Leake [2023, 5 mins]
The Plan – John Wentsworth [2021, 17 mins]
Autonomous AGI – Eliezer Yudkowsky [2015, 1 min]
Pivotal act – Eliezer Yudkowsky [2015, 9 mins]
Corrigibility – Nate Soares [2015, 9 mins]
Formal alignment: what it is, and some proposals – Tamsin Leake [2023, 2 min]

Optional bonus resources

• From the last reading above, either study one proposal deeply or skim several [1-5 hrs]:
       • Question Answer Counterfactual Interval (QACI) [2023]
       • Physicalist Superimitation (previously called Pre-DCA) [2022]
       • Universal Alignment Test (UAT) [2022]
       • Metaethical AI [2019]
Agent Foundations for Aligning Machine Intelligence with Human Interests – Nate Soares, Benja Fallenstein [2017, 42 mins]
MIRI's Approach – Nate Soares [2015, 20 mins]
The Learning-Theoretic Agenda: Status 2023– Vanessa Kosoy [2023, 66 min]

Epilogue: Even more bonus resources

Still eager for more? Here are some assorted extras you can peruse.

Eliezer Yudkowsky – Why AI Will Kill Us, Aligning LLMs, Nature of Intelligence, SciFi, & Rationality – Dwarkesh Patel [2023]
Alignment Research Field Guide – Abram Demski [2019]
AI Safety Info (FAQ with hundreds of answers) – Rob Miles’s Team [updated 2023]
Stampy Chat (LLM summarizations of content from the Alignment Research Dataset) [updated 2023]
AI alignment on Arbital (wiki) – Eliezer Yudkowsky [2015]
"Why Not Just..." – John Wentsworth [2022]
Intro to Agent Foundations (Understanding Infra-Bayesianism Part 4) – Jack Parker [2022]

Other things worth knowing about

The central hub of the AI safety field. See especially the job board and the section on funding sources.

Helps new people navigate the AI Safety ecosystem, connect with like-minded people and find projects that are a good fit for their skills.

Provides an overview of all the key organizations, projects, and programs currently operating in the AI safety space.

List of all available training programs, conferences, hackathons and events.

Repository of AI safety communities – both local and online.

The standard 3 month-long introductory courses on alignment and governance.

Feedback?

Let us know your thoughts on the course and how you think we could improve it.

© Alignment Ecosystem Development. All rights reserved.

Thank you! We appreciate your feedback.

The idea with agent foundations – Eliezer Yudkowsky [2023, 1 min]