Research
Find us on Github.
2024
Papers
Ma, R., Qu, J., Bobu, A., & Hadfield-Menell, D. (2024). Goal inference from open-ended dialog. arXiv preprint arXiv:2410.13957
Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Hebbar, V., Sleight, H., Cooper Stickland A., Perez, E., Hadfield-Menell, D., & Casper, S. (2024). Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs. arXiv preprint arXiv:2407.15549. BibTeX
Casper, S., Yun, J., Baek, J., Jung, Y., Kim, M., Kwon, K., … & Hadfield-Menell, D. (2024). The SaTML’24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability. arXiv preprint arXiv:2404.02949. BibTeX
Casper, S., Schulze, L., Patel, O., Hadfield-Menell, D. (2024) Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv preprint: ariXiv:2403.05030 BibTeX
Lynch, A., Guo, P., Ewart, A.*, Casper, S., Hadfield-Menell, D. (2024). Eight Methods to Evaluate Robust Unlearning in LLMs. arXiv preprint: ariXiv:2402.16835. BibTeX
Casper, S., Ezell, C., Siegmann, C., Kolt, N., Curtis, T., Bucknall, B., Haupt, A., Wei, K., Scheurer, J., Hobbhahn, M., Sharkey, L., Krishna, S., Von Hagen, M., Alberti, S., Chan, A., Sun, Q., Gerovitch, M., Bau, D., Tegmark, M., Krueger, D., Hadfield-Menell, D. (2024) Black-Box Access is Insufficient for Rigorous AI Audits. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2024. BibTeX
2023
Papers
Liu, K., Casper, S., Hadfield-Menell, D., Andreas., J. (2023) Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? EMNLP, 2023. BixTex
Casper S., Davies, X., Shi, C., Gilbert, T., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L., Hase, P., Bıyık, E., Dragan, A., Krueger, D., Sadigh, D., & Hadfield-Menell, D. (2023) Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint: arXiv:2307.15217. BibTeX.
Yew, R. J. & Hadfield-Menell, D. (2023). Break It Till You Make It: Limitations of Copyright Liability Under A Pre-training Paradigm of AI Development. Featured at the first annual Generative AI + Law Workshop at ICML 2023. BibTeX. ***Spotlight paper award***
Casper, S., Guo, Z., Mogulothu, S., Marinov, Z., Deshpande, C., Yew, R. J., Dai, Z., & Hadfield-Menell, D. (2023). Measuring the Success of Diffusion Models at Imitating Human Artists. Featured at the first annual Generative AI + Law Workshop at ICML 2023. BibTeX. ***Spotlight paper award***
Casper, S., Lin, J., Kwon, J., Culp, G., & Hadfield-Menell, D. (2023). Explore, Establish, Exploit: Red Teaming Language Models from Scratch. arXiv preprint arXiv:2306.09442. BibTeX.
Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S. M., Haupt, A. A., … & Sandholm, T. (2023). Steering No-Regret Learners to Optimal Equilibria. arXiv preprint arXiv:2306.05221. BibTeX.
Zhang, B. H., Farina, G., Anagnostides, I., Cacciamani, F., McAleer, S. M., Haupt, A. A., … & Sandholm, T. (2023). Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games. Advances in Neural Information Processing Systems, 36. BibTeX.
Yew, R.J., Curtis, T.L., Leake, M., Podimata, C., Hadfield-Menell, D. (2023). Policy Paths Toward an Understanding of AI Interfaces: A Case Study on Recommendation Platforms. 2023 ACM CHI Designing Technology and Policy Simultaneously Workshop.
Casper, S., Li, Y., Li, J., Bu, T., Zhang, K., Hariharan, K., Hadfield-Menell, D., (2023). Red Teaming Deep Neural Networks with Feature Synthesis Tools. Advances in Neural Information Processing Systems, 36. BibTeX.
Haupt, A., Hadfield-Menell, D., & Podimata, C. (2023). Recommending to Strategic Users. arXiv preprint arXiv:2302.06559. Bibtex.
Resources
Contracts Library for contract-based multiagent RL. Christoffersen, P.J.K., Haupt, A.A, Damani, M.
CommonClaim Dataset of 20k statements labeled by humans as common-knowledge true, common-knowledge false, and neither. Casper, S., Lin, J., Kwon, J., Culp, G., & Hadfield-Menell, D.
2022
Papers
Casper, S., Hariharan, K., Hadfield-Menell, D., (2022). Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks. arXiv preprint. BibTeX. *** Best paper award — 2022 NeurIPS Machine Learning Safety Workshop ***
Curmei, M., Haupt, A. A., Recht, B., & Hadfield-Menell, D. (2022, September). Towards Psychologically-Grounded Dynamic Preference Models. In Proceedings of the 16th ACM Conference on Recommender Systems (pp. 35-48). BibTeX.
Räuker, T., Ho, A., Casper, S., & Hadfield-Menell, D. (2022). Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. SATML 2023. BibTeX.
Casper, S., Hadfield-Menell, D., Kreiman, G (2022). White-Box Adversarial Policies in Deep Reinforcement Learning. BibTeX.
Christoffersen, P.J.K., Haupt, A.A, Hadfield-Menell, D. (2022). Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL. BibTeX.
Yew, R.J. and Hadfield-Menell, D. (2022). A Penalty Default Approach to Preemptive Harm Disclosure and Mitigation for AI Systems. In Proceedings of the 5th AAAI/ACM Conference on AI, Ethics, and Society. BibTeX.
Casper, S., Nadeau, M., Hadfield-Menell, D, & Kreiman, G (2022). Robust Feature-Level Adversaries are Interpretability Tools. Advances in Neural Information Processing Systems, 35, 33093-33106. BibTeX.