Causal and Language-Based RL Interpretability

This project area studies how to inspect learned RL policies beyond aggregate reward. It includes causal state distillation for local explanations and language-model mental modeling for probing whether interaction histories support useful reasoning about RL agents.

Related publications

Causal State Distillation for Explainable Reinforcement Learning, CLeaR 2024.
Mental Modeling of Reinforcement Learning Agents by Language Models, TMLR 2025.

TODO

TODO: add project image or diagnostic example.
TODO: add code link if public.