Introduction
This dataset consists of human preferences over different trajectories in a game that can be framed as a Markov decision process. The game is grid-based, and in it, a car must move to a goal while avoiding obstacles and minimizing costs (e.g., by minimizing gas costs or collecting coins). Trajectories in this game consist of sequences of states and actions. The dataset collects human preferences over segments of these trajectories. For example, do humans prefer that the car drives out of its way to collect a coin, or that it drives directly to the goal?
The data was collected to study how to learn a reward function from human subject preferences for use with reinforcement learning (RL). RL is a powerful tool that allows robots and other software agents to learn new behaviors through trial and error. Recent advancements in RL have significantly improved its effectiveness, making it increasingly applicable to real-world robotics challenges such as quadrupedal locomotion and autonomous driving. To increase the utility and alignment of RL agents, we study how to learn a reward function from human preferences between pairs of trajectory segments using this game. We provide the game code, which we created for this study, in our corresponding codebase. Also included is a data report file, entitled A_data_report.pdf, containing a detailed account of how the dataset was obtained and its content.
Data Collection
Subjects were shown various pairs of behaviors (ie: player trajectories) in this game and asked to label which one they preferred. The game is designed such that the objective of the game is easy to understand, but identifying optimal behavior is difficult for the players. This serves as a non-trivial test bed for various preference learning algorithms, where a good reward function learned from preferences must correctly balance various reward features.
Game Design
We designed a simple grid-world style game to show subjects when eliciting preferences. The game consists of a grid of cells, each of a specific road surface type. The player can move one cell in one of the four cardinal directions, and the player’s goal is to maximize the sum of rewards. The game can terminate either at the destination for +50 reward or in failure at a sheep for −50 reward. Cells contain other items which either result in a positive or negative reward, and the player is penalized -1 for every move they make. The implementation of this game is in our accompanying codebase. We chose one instantiation of this game for gathering our dataset of human preferences. This specific instantiation has a 10 × 10 grid. From every state, the highest return possible involves reaching the goal, rather than hitting a sheep or perpetually avoiding termination. Figure 1 shows this task.
Human Subjects
143 subjects were recruited via Amazon Mechanical Turk. We filtered workers based on task comprehension (see data report for more details) and required that all workers were located in the United States, had an approval rating of at least 99%, and completed at least 100 other MTurk HITs. The resulting dataset comprises data collected from 50 subjects. This filtered data consists of 1812 preferences over 1245 unique segment pairs. This data collection was IRB-approved.
Dataset Organization and Contents
The full dataset is organized in two directories. The directory deliver_mdp contains all collected human preferences, as well as the corresponding game that subjects were shown and all additional data needed to learn a reward function from these preferences. The directory entitled random_mdps contains 200 additional game instantiations, as well as synthetically generated preferences for each of these games. For further information on the specific files and what they contain, refer to the data report located in this repository.
Results Summary
A preference model is a mathematical representation of a person's preferences over different trajectory segments. Preferences are expressed in terms of pairwise comparisons between segments, where the preference model takes in two segments and outputs the probability that a human would prefer one segment to the other. Given a preference model and a dataset of preferences generated by this model, one can then learn a reward function for an RL task. This dataset was used to evaluate two different preference models; the ubiquitously assumed partial return model and our proposed regret model. Our corresponding paper shows that the regret model is a better predictor of human preferences than the partial return model, and that it results in a learned reward function that, when optimized over, induces more performant behavior under the game's true reward function.
Code
We provide scripts for reproducing the experiments which include learning and evaluating reward functions from the provided preference datasets. The code accompanying this dataset can be found here. This dataset contains a script entitled example.py, which provides a bare-bones example of loading a preference dataset and learning a reward function from it. All code is open source.
Data Reuse
This dataset can be used to reproduce the analysis in the paper Models of Human Preference for Learning Reward Functions. See Related Publication referenced in the metadata, as well as to train and test new reward learning algorithms.
Bulk Data Download
A script named download_data.py is provided for bulk data download.