This project is housed at the Institute of Futures Research and relates to understanding and regularizing challenges to goal and value alignment in artificial intelligent (AI) systems when those systems exhibit nontrivial degrees of behavioral freedom and flexibility, and agency. Of particular concern are resilient failure modes, that is, failure modes that are intractable to methodological or technological resolution, owing to e.g. fundamental conflicts in the underlying ethical theory, or epistemic issues such as persistent ambiguity between the ethical theory, empirical facts, and any world models and policies held by the AI.
I will also be characterizing a resilient failure mode which has not apparently been addressed in the extant literature: misalignment incurred when reasoning and acting from shifting levels of abstraction. An intelligence apparently aligned in its outputs via some mechanism to a state space is not guaranteed to be aligned in the event that state space expands, for instance, through in-context learning or reasoning upon metastatements. This project will motivate, clarify, and formalize this failure mode as it pertains to artificial intelligence systems.
Within the scope of this research project, I am conducting a review of the literature pertaining to artificial intelligence alignment methods and failure modes, epistemological challenges to goal and value alignment, impossibility theorems in population and utilitarian ethics, and the nature of agency as it pertains to artifacts. A nonexhaustive bibliography follows.
I am greatly interested in potential feedback on this project, and suggestions for further reading.
References
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, & Dan Mané. (2016). Concrete Problems in AI Safety. https://doi.org/10.48550/arXiv.1606.06565
Peter Eckersley. (2019). Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function). https://doi.org/10.48550/arXiv.1901.00064
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, & Stuart Russell. (2016). Cooperative Inverse Reinforcement Learning. https://doi.org/10.48550/arXiv.1606.03137
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, & Shane Legg. (2017). AI Safety Gridworlds. https://doi.org/10.48550/arXiv.1711.09883
Scott McLean, Gemma J. M. Read, Jason Thompson, Chris Baber, Neville A. Stanton & Paul M. Salmon(2023)The risks associated with Artificial General Intelligence: A systematic review,Journal of Experimental & Theoretical Artificial Intelligence,35:5,649-663,DOI: 10.1080/0952813X.2021.1964003
Richard Ngo, Lawrence Chan, & Sören Mindermann. (2023). The alignment problem from a deep learning perspective. https://doi.org/10.48550/arXiv.2209.00626
Petersen, S. (2017). Superintelligence as Superethical. In P. Lin, K. Abney, & R. Jenkins (Eds.), Robot Ethics 2. 0: New Challenges in Philosophy, Law, and Society (pp. 322–337). New York, USA: Oxford University Press.
Max Reuter, & William Schulze. (2023). I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative Language Models. https://doi.org/10.48550/arXiv.2306.03423
Jonas Schuett, Noemi Dreksler, Markus Anderljung, David McCaffary, Lennart Heim, Emma Bluemke, & Ben Garfinkel. (2023). Towards best practices in AGI safety and governance: A survey of expert opinion. https://doi.org/10.48550/arXiv.2305.07153
Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard, & Wojciech Marian Czarnecki. (2021). Open-Ended Learning Leads to Generally Capable Agents. https://doi.org/10.48550/arXiv.2107.12808
Roman V. Yampolskiy(2014)Utility function security in artificially intelligent agents,Journal of Experimental & Theoretical Artificial Intelligence,26:3,373-389. https://doi.org/10.1080/0952813X.2014.895114
