View on GitHub

Algorithmic Alignment Group

Researching frameworks for human-aligned AI @ MIT CSAIL.


Find us on Github.


Räuker, T., Ho, A., Casper, S., & Hadfield-Menell, D. (2022). Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. arXiv preprint arXiv:2207.13243. BibTeX

Casper, S., Hadfield-Menell, D., Kreiman, G (2022). White-Box Adversarial Policies in Deep Reinforcement Learning.

Christoffersen, P.J.K., Haupt, A.A, Hadfield-Menell, D. (2022). Get It in Writing: Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL. BibTeX

Yew, R.J. and Hadfield-Menell, D. (2022). A Penalty Default Approach to Preemptive Harm Disclosure and Mitigation for AI Systems. In Proceedings of the 5th AAAI/ACM Conference on AI, Ethics, and Society. BibTeX

Casper, S., Nadeau, M., Hadfield-Menell, D, & Kreiman, G (2022). Robust Feature-Level Adversaries are Interpretability Tools. BibTeX