Critical Pitfall in Reward Learning from Human Feedback

In (Shaheen et al., 2026), we show that human feedback in reward learning can reflect not only what people want, but also what people believe about the environment dynamics. If a feedback provider misunderstands how the world works, standard reward learning methods can mistake that misconception for a true preference.

More details coming soon…

Acknowledgements

This research was supported in part by NSF grant 2047186 and the 2025 ASU Graduate Student Government JumpStart Grant. The study was approved by the Arizona State University Institutional Review Board.

References

2026

IJCAI
Empirical Evidence and Analysis of a Critical Pitfall in Reward Learning from Human Feedback

Taha Shaheen , Stephen G. West , and Yu Zhang

In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), Aug 2026

Abs Bib Website Read

Reward learning via human feedback is a crucial capability for beneficial AI. Current methods are built on decision-making theories that assume a matched dynamics model between the learning agent and the feedback provider. However, humans often form imperfect internal dynamics models, and their feedback reflects these misconceptions. While this relationship has long been hypothesised, its manifestation in sequential decision-making remains largely an assumption. Our work provides the first comprehensive empirical investigation of this relationship through a randomized controlled trial (N=211). We followed a two-stage design where we first initialized the participants’ understanding of the dynamics in a grid-world navigation domain and then manipulated it using text-based instructions. Causal mediation analysis revealed that humans’ internal models play a mediating role in feedback behaviour. We show that this relationship is invariant across visual contexts and is robust to three common feedback types: pairwise preferences, trajectory corrections, and off-switch interventions. These findings confirm a critical limitation of current reward learning methods and establish the missing psychological foundation for approaches that incorporate dynamics understanding.
@inproceedings{shaheen2026criticalpitfall, title = {Empirical Evidence and Analysis of a Critical Pitfall in Reward Learning from Human Feedback}, author = {Shaheen, Taha and West, Stephen G. and Zhang, Yu}, booktitle = {Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)}, year = {2026}, month = aug, url = {https://par.nsf.gov/servlets/purl/10683099}, }