A curated list of resources dedicated to research in AI safety. As AI systems become increasingly influential in society, ensuring that they operate safely, truthfully, and robustly is more critical than ever. This list emphasizes contemporary research addressing safety challenges with modern, highly capable AI models and will be updated regularly.
- Overview
- Alignment
- Instrumental Convergence
- Reward Hacking
- Adversarial Robustness
- Debate
- Honesty
- Chain-of-thought Faithfulness
- Weak-to-strong Generalization
- Mechanistic Interpretability
- Evaluation
- Contribution
This section presents foundational papers that introduce key concepts, challenges, and frameworks in AI safety.
-
Concrete Problems in AI Safety. [2016]
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané
[pdf] -
The Off-Switch Game. [2016]
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
[pdf] -
AI Safety Gridworlds. [2017]
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg
[pdf] -
Unsolved Problems in ML Safety. [2021]
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
[pdf] -
Foundational Challenges in Assuring Alignment and Safety of Large Language Models. [2024]
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al.
[pdf] -
Towards Guaranteed Safe AI. [2024]
David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum
[pdf] -
Can a Bayesian Oracle Prevent Harm from an Agent? [2024]
Yoshua Bengio, Michael K. Cohen, Nikolay Malkin, Matt MacDermott, Damiano Fornasiere, Pietro Greiner, Younesse Kaddar
[pdf] -
International AI Safety Report. [2025]
Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, et al.
[pdf] -
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? [2025]
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King
[pdf]
Alignment research focuses on ensuring AI systems act in accordance with human values and intentions. This area addresses the fundamental challenge of specifying what we want AI systems to do and creating techniques to train them to pursue these objectives faithfully.
-
Scalable agent alignment via reward modeling: a research direction. [2018]
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg
[pdf] -
Learning Human Objectives by Evaluating Hypothetical Behavior. [2019]
Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike
[pdf] -
What Would Jiminy Cricket Do? Towards Agents That Behave Morally. [2021]
Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
[pdf] -
Goal Misgeneralization in Deep Reinforcement Learning. 8000 [2021]
Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger
[pdf] -
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. [2022]
Alexander Pan, Kush Bhatia, Jacob Steinhardt
[pdf] -
Training language models to follow instructions with human feedback. [2022]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
[pdf] -
Scaling Laws for Reward Model Overoptimization. [2022]
Leo Gao, John Schulman, Jacob Hilton
[pdf] -
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios. [2022]
Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
[pdf] -
Characterizing Manipulation from AI Systems. [2023]
Micah Carroll, Alan Chan, Henry Ashton, David Krueger
[pdf] -
Alignment faking in large language models. [2024]
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger
[pdf] -
Deliberative Alignment: Reasoning Enables Safer Language Models. [2024]
Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, Amelia Glaese
[pdf]
Instrumental convergence examines how AI systems with different goals might develop similar subgoals—like seeking power or preventing shutdown—as instrumentally useful steps toward their primary objectives. This research area explores how, when and why systems develop those behaviors, helping us understand potential risks from advanced AI systems even when their ultimate goals seem benign.
-
Optimal Policies Tend to Seek Power. [2019]
Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
[pdf] -
Parametrically Retargetable Decision-Makers Tend To Seek Power. [2022]
Alexander Matt Turner, Prasad Tadepalli
[pdf] -
Power-Seeking Can Be Probable and Predictive for Trained Agents. [2023]
Victoria Krakovna, Janos Kramar
[pdf] -
Tell me about yourself: LLMs are aware of their learned behaviors. [2025]
Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans
[pdf] -
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs. [2025]
Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks
[pdf] -
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? [2025]
Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi
[pdf] -
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. [2025]
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
[pdf]
Reward hacking research examines how AI systems might exploit flaws or misspecifications in reward functions to achieve high rewards without fulfilling the intended objectives.
-
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective. [2019]
Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna
[pdf] -
Defining and Characterizing Reward Hacking. [2022]
Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger
[pdf] -
Towards Understanding Sycophancy in Language Models. [2023]
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
[pdf] -
Feedback Loops With Language Models Drive In-Context Reward Hacking. [2024]
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
[pdf] -
Spontaneous Reward Hacking in Iterative Self-Refinement. [2024]
Jane Pan, He He, Samuel R. Bowman, Shi Feng
[pdf] -
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models. [2024]
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
[pdf] -
Reward Shaping to Mitigate Reward Hacking in RLHF. [2025]
Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
[pdf]
Adversarial robustness research addresses the vulnerability of AI systems to attempts to circumvent their safety measures or manipulate their behavior. This area studies how AI systems might be compromised through techniques like jailbreaking, prompt injection, and data poisoning, as well as methods to defend against such attacks or to make the system more reliable.
-
Adversarial Training for High-Stakes Reliability. [2022]
Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
[pdf] -
Poisoning Language Models During Instruction Tuning. [2023]
Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein
[pdf] -
Are aligned neural networks adversarially aligned? [2023]
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt
[pdf] -
Jailbroken: How Does LLM Safety Training Fail? [2023]
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
[pdf] -
Universal and Transferable Adversarial Attacks on Aligned Language Models. [2023]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
[pdf] -
AI Control: Improving Safety Despite Intentional Subversion. [2024]
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
[pdf] -
Many-shot jailbreaking. [2024]
Anthropic
[pdf] -
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. [2024]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
[pdf] -
Simple probes can catch sleeper agents. [2024]
Anthropic
[pdf] -
Holistic Safety and Responsibility Evaluations of Advanced AI Models. [2024]
Laura Weidinger, Joslyn Barnhart, Jenny Brennan, Christina Butterfield, Susie Young, Will Hawkins, Lisa Anne Hendricks, Ramona Comanescu, Oscar Chang, Mikel Rodriguez, et al.
[pdf] -
Improving Alignment and Robustness with Circuit Breakers. [2024]
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
[pdf] -
Tamper-Resistant Safeguards for Open-Weight LLMs. [2024]
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, et al.
[pdf] -
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. [2025]
Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
[pdf]
Debate approaches leverage the ability of AI systems to critique each other's outputs as a way to improve truthfulness and reasoning. By having models engage in structured debates—with humans judging the results or models evaluating each other—we can potentially elicit more accurate information and uncover flaws in reasoning that might otherwise remain hidden.
-
AI safety via debate. [2018]
Geoffrey Irving, Paul Christiano, Dario Amodei
[pdf] -
Scalable AI Safety via Doubly-Efficient Debate. [2023]
Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras
[pdf] -
Debating with More Persuasive LLMs Leads to More Truthful Answers. [2024]
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefensthe, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez
[pdf]
Honesty research focuses on ensuring AI systems provide truthful, accurate information rather than generating plausible-sounding falsehoods.
-
TruthfulQA: Measuring How Models Mimic Human Falsehoods. [2021]
Stephanie Lin, Jacob Hilton, Owain Evans
[pdf] -
Truthful AI: Developing and governing AI that does not lie. [2021]
Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders
[pdf] -
Discovering Latent Knowledge in Language Models Without Supervision. [2022]
Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt
[pdf] -
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. [2023]
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
[pdf] -
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. [2023]
Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner
[pdf]
This research area examines whether the reasoning processes of AI systems—particularly when they "think step by step"—actually support their conclusions in a logical, faithful manner. As models increasingly explain their reasoning, ensuring these explanations accurately reflect sound logical processes becomes crucial.
-
Learning to Give Checkable Answers with Prover-Verifier Games [2021]
Cem Anil, Guodong Zhang, Yuhuai Wu, Roger Grosse
[pdf] -
Faithful Chain-of-Thought Reasoning. [2023]
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, Chris Callison-Burch
[pdf] -
Measuring Faithfulness in Chain-of-Thought Reasoning. [2023]
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan
[pdf] -
Prover-Verifier Games Improve Legibility of LLM Outputs. [2024]
Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, Yuri Burda
[pdf]
Weak-to-strong generalization investigates whether and how safety properties in less capable AI systems might transfer to more powerful ones. This emerging area explores techniques where weaker models help align stronger ones, offering a potential path to maintaining control as AI capabilities rapidly advance.
-
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. [2023]
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
[pdf] -
Theoretical Analysis of Weak-to-Strong Generalization. [2024]
Hunter Lang, David Sontag, Aravindan Vijayaraghavan
[pdf] -
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning. [2024]
Ruimeng Ye, Yang Xiao, Bo Hui
[pdf]
Mechanistic interpretability focuses on developing techniques to understand the internal workings of neural networks at a detailed level.
-
Representation Engineering: A Top-Down Approach to AI Transparency. [2023]
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
[pdf] -
Interpreting Learned Feedback Patterns in Large Language Models. [2023]
Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez
[pdf] -
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. [2024]
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
[pdf] -
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. [2024]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan
[pdf] -
Predicting Future Actions of Reinforcement Learning Agents. [2024]
Stephen Chung, Scott Niekum, David Krueger
[pdf]
Evaluation research develops benchmarks, tests, and methodologies to assess AI systems' safety properties. This area creates rigorous ways to measure alignment, identify potential risks, and track progress in safety research.
-
Red Teaming Language Models with Language Models. [2022]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving
[pdf] -
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark. [2023]
Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
[pdf] -
Model Evaluation for Extreme Risks. [2023]
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe
[pdf] -
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. [2023]
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
[pdf] -
Sociotechnical Safety Evaluation of Generative AI Systems. [2023]
Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, et al.
[pdf] -
Can LLMs Follow Simpl 6491 e Rules? [2023]
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner
[pdf] -
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. [2024]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
[pdf] -
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. [2024]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al.
[pdf] -
Evaluating Frontier Models for Dangerous Capabilities. [2024]
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, et al.
[pdf] -
Holistic Safety and Responsibility Evaluations of Advanced AI Models. [2024]
Laura Weidinger, Joslyn Barnhart, Jenny Brennan, Christina Butterfield, Susie Young, Will Hawkins, Lisa Anne Hendricks, Ramona Comanescu, Oscar Chang, Mikel Rodriguez, et al.
[pdf] -
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [2024]
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
[pdf] -
Forecasting Rare Language Model Behaviors. [2025]
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma
[pdf]