Reward Processing Biases in Humans and RL Agents🔗

Abstract🔗

Drawing inspirations from studies of human behavior, we propose a general and flexible parametric framework for sequential decision-making based on a two-stream mechanism for processing positive and negative rewards. Our framework extends standard problem settings, such as multi-armed bandits (MAB), contextual bandits (CB), and general reinforcement learning (RL), allowing us to incorporate a wide range of reward-processing biases -- an important component of human decision making that can help us better understand a wide spectrum of multi-agent interactions in complex real-world socioeconomic systems, as well as various neuropsychiatric conditions associated with disruptions in normal reward processing. The reward processing biases are modeled by a combination of weights on incoming two-stream rewards, as well as on memories about the prior reward history, resulting in more flexible parametric approaches that can outperform standard algorithms for sequential decision making, such as, for example, Q-Learning and SARSA methods, as well as recently proposed Double Q-Learning, on a variety of simulated and realistic tasks.

Speaker🔗

Irina Rish

Class material🔗

Slides