Algorithmic Alignment Group

Research

Find us on Github.

2025

Papers

Khan, A., Casper, S., & Hadfield-Menell, D. (2025). Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs. arXiv preprint arXiv:2503.08688. BibTeX

Slocum, S., Parker-Sartori, A., & Hadfield-Menell, D. Diverse Preference Learning for Capabilities and Alignment. In The Thirteenth International Conference on Learning Representations. BibTeX

Casper, S., Krueger, D., & Hadfield-Menell, D. (2025). Pitfalls of Evidence-Based AI Policy. arXiv preprint arXiv:2502.09618. BibTeX

Che, Z., Casper, S., Kirk, R., Satheesh, A., Slocum, S., McKinney, L. E., … & Hadfield-Menell, D. (2025). Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. arXiv preprint arXiv:2502.05209. BibTeX

Casper, S., Bailey, L., Hunter, R., Ezell, C., Cabalé, E., Gerovitch, M., … & Kolt, N. (2025). The AI Agent Index. arXiv preprint arXiv:2502.01635. BibTeX

2024

Papers

Ma, R., Qu, J., Bobu, A., & Hadfield-Menell, D. (2024). Goal inference from open-ended dialog. arXiv preprint arXiv:2410.13957 BibTeX

Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Hebbar, V., Sleight, H., Cooper Stickland A., Perez, E., Hadfield-Menell, D., & Casper, S. (2024). Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs. arXiv preprint arXiv:2407.15549. BibTeX

Casper, S., Yun, J., Baek, J., Jung, Y., Kim, M., Kwon, K., … & Hadfield-Menell, D. (2024). The SaTML’24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability. arXiv preprint arXiv:2404.02949. BibTeX

Casper, S., Schulze, L., Patel, O., Hadfield-Menell, D. (2024) Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv preprint: ariXiv:2403.05030 BibTeX

Lynch, A., Guo, P., Ewart, A.*, Casper, S., Hadfield-Menell, D. (2024). Eight Methods to Evaluate Robust Unlearning in LLMs. arXiv preprint: ariXiv:2402.16835. BibTeX

Casper, S., Ezell, C., Siegmann, C., Kolt, N., Curtis, T., Bucknall, B., Haupt, A., Wei, K., Scheurer, J., Hobbhahn, M., Sharkey, L., Krishna, S., Von Hagen, M., Alberti, S., Chan, A., Sun, Q., Gerovitch, M., Bau, D., Tegmark, M., Krueger, D., Hadfield-Menell, D. (2024) Black-Box Access is Insufficient for Rigorous AI Audits. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2024. BibTeX

2023

Papers

Liu, K., Casper, S., Hadfield-Menell, D., Andreas., J. (2023) Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? EMNLP, 2023. BixTex

Casper S., Davies, X., Shi, C., Gilbert, T., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L., Hase, P., Bıyık, E., Dragan, A., Krueger, D., Sadigh, D., & Hadfield-Menell, D. (2023) Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint: arXiv:2307.15217. BibTeX.

Yew, R. J. & Hadfield-Menell, D. (2023). Break It Till You Make It: Limitations of Copyright Liability Under A Pre-training Paradigm of AI Development. Featured at the first annual Generative AI + Law Workshop at ICML 2023. BibTeX. ***Spotlight paper award***

Casper, S., Guo, Z., Mogulothu, S., Marinov, Z., Deshpande, C., Yew, R. J., Dai, Z., & Hadfield-Menell, D. (2023). Measuring the Success of Diffusion Models at Imitating Human Artists. Featured at the first annual Generative AI + Law Workshop at ICML 2023. BibTeX. ***Spotlight paper award***

Casper, S., Lin, J., Kwon, J., Culp, G., & Hadfield-Menell, D. (2023). Explore, Establish, Exploit: Red Teaming Language Models from Scratch. arXiv preprint arXiv:2306.09442. BibTeX.

Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S. M., Haupt, A. A., … & Sandholm, T. (2023). Steering No-Regret Learners to Optimal Equilibria. arXiv preprint arXiv:2306.05221. BibTeX.

Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S. M., Haupt, A. A., … & Sandholm, T. (2023). Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games. Advances in Neural Information Processing Systems, 36. BibTeX.

Yew, R.J., Curtis, T.L., Leake, M., Podimata, C., Hadfield-Menell, D. (2023). Policy Paths Toward an Understanding of AI Interfaces: A Case Study on Recommendation Platforms. 2023 ACM CHI Designing Technology and Policy Simultaneously Workshop.

Casper, S., Li, Y., Li, J., Bu, T., Zhang, K., Hariharan, K., Hadfield-Menell, D., (2023). Red Teaming Deep Neural Networks with Feature Synthesis Tools. Advances in Neural Information Processing Systems, 36. BibTeX.

Haupt, A., Hadfield-Menell, D., & Podimata, C. (2023). Recommending to Strategic Users. arXiv preprint arXiv:2302.06559. Bibtex.

Resources

Contracts Library for contract-based multiagent RL. Christoffersen, P.J.K., Haupt, A.A, Damani, M.

CommonClaim Dataset of 20k statements labeled by humans as common-knowledge true, common-knowledge false, and neither. Casper, S., Lin, J., Kwon, J., Culp, G., & Hadfield-Menell, D.

2022

Papers

Casper, S., Hariharan, K., Hadfield-Menell, D., (2022). Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks. arXiv preprint. BibTeX. *** Best paper award — 2022 NeurIPS Machine Learning Safety Workshop ***

Curmei, M., Haupt, A. A., Recht, B., & Hadfield-Menell, D. (2022, September). Towards Psychologically-Grounded Dynamic Preference Models. In Proceedings of the 16th ACM Conference on Recommender Systems (pp. 35-48). BibTeX.

Räuker, T., Ho, A., Casper, S., & Hadfield-Menell, D. (2022). Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. SATML 2023. BibTeX.

Casper, S., Hadfield-Menell, D., Kreiman, G (2022). White-Box Adversarial Policies in Deep Reinforcement Learning. BibTeX.

Christoffersen, P.J.K., Haupt, A.A, Hadfield-Menell, D. (2022). Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL. BibTeX.

Yew, R.J. and Hadfield-Menell, D. (2022). A Penalty Default Approach to Preemptive Harm Disclosure and Mitigation for AI Systems. In Proceedings of the 5th AAAI/ACM Conference on AI, Ethics, and Society. BibTeX.

Casper, S., Nadeau, M., Hadfield-Menell, D, & Kreiman, G (2022). Robust Feature-Level Adversaries are Interpretability Tools. Advances in Neural Information Processing Systems, 35, 33093-33106. BibTeX.