Causal and Language-Based RL Interpretability

Analyses of RL agents using causal state distillation and language-model mental modeling to inspect learned behavior from trajectories.

This project area studies how to inspect learned RL policies beyond aggregate reward. It includes causal state distillation for local explanations and language-model mental modeling for probing whether interaction histories support useful reasoning about RL agents.

Related publications

  • Causal State Distillation for Explainable Reinforcement Learning, CLeaR 2024.
  • Mental Modeling of Reinforcement Learning Agents by Language Models, TMLR 2025.

TODO

  • TODO: add project image or diagnostic example.
  • TODO: add code link if public.