Reproduction Data for: Models of Human Preference for Learning Reward Functions

Version 2.0

W. Bradley Knox; Stephane Hatgis-Kessell; Serena Booth; Scott Niekum; Peter Stone; Alessandro Allievi, 2023, "Reproduction Data for: Models of Human Preference for Learning Reward Functions", https://doi.org/10.18738/T8/S4WTWR, Texas Data Repository, V2

Learn about Data Citation Standards.

Contact Owner

Make Data Count (MDC) Metrics

since

573 Views

130 Downloads

0 Citations

Description	Introduction This dataset consists of human preferences over different trajectories in a game that can be framed as a Markov decision process. The game is grid-based, and in it, a car must move to a goal while avoiding obstacles and minimizing costs (e.g., by minimizing gas costs or collecting coins). Trajectories in this game consist of sequences of states and actions. The dataset collects human preferences over segments of these trajectories. For example, do humans prefer that the car drives out of its way to collect a coin, or that it drives directly to the goal? The data was collected to study how to learn a reward function from human subject preferences for use with reinforcement learning (RL). RL is a powerful tool that allows robots and other software agents to learn new behaviors through trial and error. Recent advancements in RL have significantly improved its effectiveness, making it increasingly applicable to real-world robotics challenges such as quadrupedal locomotion and autonomous driving. To increase the utility and alignment of RL agents, we study how to learn a reward function from human preferences between pairs of trajectory segments using this game. We provide the game code, which we created for this study, in our corresponding codebase. Also included is a data report file, entitled A_data_report.pdf, containing a detailed account of how the dataset was obtained and its content. Data Collection Subjects were shown various pairs of behaviors (ie: player trajectories) in this game and asked to label which one they preferred. The game is designed such that the objective of the game is easy to understand, but identifying optimal behavior is difficult for the players. This serves as a non-trivial test bed for various preference learning algorithms, where a good reward function learned from preferences must correctly balance various reward features. Game Design We designed a simple grid-world style game to show subjects when eliciting preferences. The game consists of a grid of cells, each of a specific road surface type. The player can move one cell in one of the four cardinal directions, and the player’s goal is to maximize the sum of rewards. The game can terminate either at the destination for +50 reward or in failure at a sheep for −50 reward. Cells contain other items which either result in a positive or negative reward, and the player is penalized -1 for every move they make. The implementation of this game is in our accompanying codebase. We chose one instantiation of this game for gathering our dataset of human preferences. This specific instantiation has a 10 × 10 grid. From every state, the highest return possible involves reaching the goal, rather than hitting a sheep or perpetually avoiding termination. Figure 1 shows this task. Human Subjects 143 subjects were recruited via Amazon Mechanical Turk. We filtered workers based on task comprehension (see data report for more details) and required that all workers were located in the United States, had an approval rating of at least 99%, and completed at least 100 other MTurk HITs. The resulting dataset comprises data collected from 50 subjects. This filtered data consists of 1812 preferences over 1245 unique segment pairs. This data collection was IRB-approved. Dataset Organization and Contents The full dataset is organized in two directories. The directory deliver_mdp contains all collected human preferences, as well as the corresponding game that subjects were shown and all additional data needed to learn a reward function from these preferences. The directory entitled random_mdps contains 200 additional game instantiations, as well as synthetically generated preferences for each of these games. For further information on the specific files and what they contain, refer to the data report located in this repository. Results Summary A preference model is a mathematical representation of a person's preferences over different trajectory segments. Preferences are expressed in terms of pairwise comparisons between segments, where the preference model takes in two segments and outputs the probability that a human would prefer one segment to the other. Given a preference model and a dataset of preferences generated by this model, one can then learn a reward function for an RL task. This dataset was used to evaluate two different preference models; the ubiquitously assumed partial return model and our proposed regret model. Our corresponding paper shows that the regret model is a better predictor of human preferences than the partial return model, and that it results in a learned reward function that, when optimized over, induces more performant behavior under the game's true reward function. Code We provide scripts for reproducing the experiments which include learning and evaluating reward functions from the provided preference datasets. The code accompanying this dataset can be found here. This dataset contains a script entitled example.py, which provides a bare-bones example of loading a preference dataset and learning a reward function from it. All code is open source. Data Reuse This dataset can be used to reproduce the analysis in the paper Models of Human Preference for Learning Reward Functions. See Related Publication referenced in the metadata, as well as to train and test new reward learning algorithms. Bulk Data Download A script named download_data.py is provided for bulk data download.
Subject	Computer and Information Science
Keyword	Robotics, Reinforcment Learning, Alignment, Preference Learning, Grid-world Game
Related Publication	Knox, W. Bradley, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, Alessandro Allievi "Models of human preference for learning reward functions." arXiv preprint arXiv:2206.02231 (2022). arXiv: 2206.02231
Notes	The code accompanying this dataset can be found here.
License/Data Use Agreement	CC0 1.0

Change View

Table

Tree

Filter by

	1 to 10 of 530 Files	Download
	A_data_report.pdf.pdf Adobe PDF - 519.4 KB Published Jun 16, 2023 7 Downloads MD5: 5c2b394018778a74fb4c3e1fc24e34ce	Preview "A_data_report.pdf.pdf" Access File File Access Public Download Options Adobe PDF Download Metadata Data File Citation EndNote XML RIS BibTeX Explore Options Read Document
	download_data.py Python Source Code - 1.4 KB Published Jun 28, 2023 4 Downloads MD5: 2ba9b91e94f15e3615471b71bea81305	Access File File Access Public Download Options Python Source Code Download Metadata Data File Citation EndNote XML RIS BibTeX
	examples.py Python Source Code - 5.4 KB Published Jun 16, 2023 4 Downloads MD5: 4b957df0c543d7cb3c1b19cd2fd7333e	Access File File Access Public Download Options Python Source Code Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_env.pickle delivery_mdp/Unknown - 4.9 KB Published Jun 16, 2023 2 Downloads MD5: 5578cbeab35527d7932b12aede56bc83	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_gt_rew_vec.npy delivery_mdp/Unknown - 176 B Published Jun 16, 2023 2 Downloads MD5: ff22405c2b1c0384ae075326fd505d1f	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_human_prefs.npy delivery_mdp/Unknown - 14.1 KB Published Jun 16, 2023 3 Downloads MD5: 26d3eb9d2b2a9e0d9969d07f1c8ba921	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_segment_pair_features_pr_form.npy delivery_mdp/Unknown - 168.2 KB Published Jun 16, 2023 2 Downloads MD5: 429c53bb99cc19c685331e4854f7d949	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_segment_pair_features_regret_form.npy delivery_mdp/Unknown - 280.3 KB Published Jun 16, 2023 2 Downloads MD5: c4fb6bfd9126d87aaa3d492e08dced41	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_succ_feats.npy delivery_mdp/Unknown - 323.6 KB Published Jun 16, 2023 2 Downloads MD5: ad997908f0ee79b3f5e7f0b48631c4d4	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX
	DELIVERY_MDP_trajs.npy delivery_mdp/Unknown - 112.2 KB Published Jun 16, 2023 2 Downloads MD5: 8faffa30477f0d439dbce2f1af24feae	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.18738/T8/S4WTWR
Publication Date	2023-06-16
Title	Reproduction Data for: Models of Human Preference for Learning Reward Functions
Author	W. Bradley Knox (Bosch, The University of Texas at Austin) Stephane Hatgis-Kessell (University of Texas at Austin) Serena Booth (Bosch, MIT CSAIL) Scott Niekum (The University of Texas at Austin) Peter Stone (The University of Texas at Austin, Sony AI) Alessandro Allievi (Bosch)
Point of Contact	Use email button above to contact. Stephane Hatgis-Kessell (University of Texas at Austin)
Description	Introduction This dataset consists of human preferences over different trajectories in a game that can be framed as a Markov decision process. The game is grid-based, and in it, a car must move to a goal while avoiding obstacles and minimizing costs (e.g., by minimizing gas costs or collecting coins). Trajectories in this game consist of sequences of states and actions. The dataset collects human preferences over segments of these trajectories. For example, do humans prefer that the car drives out of its way to collect a coin, or that it drives directly to the goal? The data was collected to study how to learn a reward function from human subject preferences for use with reinforcement learning (RL). RL is a powerful tool that allows robots and other software agents to learn new behaviors through trial and error. Recent advancements in RL have significantly improved its effectiveness, making it increasingly applicable to real-world robotics challenges such as quadrupedal locomotion and autonomous driving. To increase the utility and alignment of RL agents, we study how to learn a reward function from human preferences between pairs of trajectory segments using this game. We provide the game code, which we created for this study, in our corresponding codebase. Also included is a data report file, entitled A_data_report.pdf, containing a detailed account of how the dataset was obtained and its content. Data Collection Subjects were shown various pairs of behaviors (ie: player trajectories) in this game and asked to label which one they preferred. The game is designed such that the objective of the game is easy to understand, but identifying optimal behavior is difficult for the players. This serves as a non-trivial test bed for various preference learning algorithms, where a good reward function learned from preferences must correctly balance various reward features. Game Design We designed a simple grid-world style game to show subjects when eliciting preferences. The game consists of a grid of cells, each of a specific road surface type. The player can move one cell in one of the four cardinal directions, and the player’s goal is to maximize the sum of rewards. The game can terminate either at the destination for +50 reward or in failure at a sheep for −50 reward. Cells contain other items which either result in a positive or negative reward, and the player is penalized -1 for every move they make. The implementation of this game is in our accompanying codebase. We chose one instantiation of this game for gathering our dataset of human preferences. This specific instantiation has a 10 × 10 grid. From every state, the highest return possible involves reaching the goal, rather than hitting a sheep or perpetually avoiding termination. Figure 1 shows this task. Human Subjects 143 subjects were recruited via Amazon Mechanical Turk. We filtered workers based on task comprehension (see data report for more details) and required that all workers were located in the United States, had an approval rating of at least 99%, and completed at least 100 other MTurk HITs. The resulting dataset comprises data collected from 50 subjects. This filtered data consists of 1812 preferences over 1245 unique segment pairs. This data collection was IRB-approved. Dataset Organization and Contents The full dataset is organized in two directories. The directory deliver_mdp contains all collected human preferences, as well as the corresponding game that subjects were shown and all additional data needed to learn a reward function from these preferences. The directory entitled random_mdps contains 200 additional game instantiations, as well as synthetically generated preferences for each of these games. For further information on the specific files and what they contain, refer to the data report located in this repository. Results Summary A preference model is a mathematical representation of a person's preferences over different trajectory segments. Preferences are expressed in terms of pairwise comparisons between segments, where the preference model takes in two segments and outputs the probability that a human would prefer one segment to the other. Given a preference model and a dataset of preferences generated by this model, one can then learn a reward function for an RL task. This dataset was used to evaluate two different preference models; the ubiquitously assumed partial return model and our proposed regret model. Our corresponding paper shows that the regret model is a better predictor of human preferences than the partial return model, and that it results in a learned reward function that, when optimized over, induces more performant behavior under the game's true reward function. Code We provide scripts for reproducing the experiments which include learning and evaluating reward functions from the provided preference datasets. The code accompanying this dataset can be found here. This dataset contains a script entitled example.py, which provides a bare-bones example of loading a preference dataset and learning a reward function from it. All code is open source. Data Reuse This dataset can be used to reproduce the analysis in the paper Models of Human Preference for Learning Reward Functions. See Related Publication referenced in the metadata, as well as to train and test new reward learning algorithms. Bulk Data Download A script named download_data.py is provided for bulk data download.
Subject	Computer and Information Science
Keyword	Robotics Reinforcment Learning Alignment Preference Learning Grid-world Game
Related Publication	Knox, W. Bradley, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, Alessandro Allievi "Models of human preference for learning reward functions." arXiv preprint arXiv:2206.02231 (2022). arXiv: 2206.02231
Notes	The code accompanying this dataset can be found here.
Production Date	2021
Production Location	University of Texas at Austin
Contributor	Data Curator : Esteva, Maria
Funding Information	Robert Bosch LLC National Science Foundation
Deposit Date	2023-01-23
Date of Collection	Start Date: 2021-07-29 ; End Date: 2021-12-22
Data Type	Amazon MTURK preference data, tasks descriptors, codebase
Software	Python, Version: 3.8.0

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Creative Commons CC0 1.0 Universal Public Domain Dedication. CC0 1.0

	Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Restricted Files Selected

The selected file(s) may not be downloaded because you have not been granted access.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 1.9 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Use a Private URL to allow those without Dataverse accounts to access your unpublished dataset. For more information about the Private URL feature, please refer to the User Guide.

Private URL has not been created.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? The selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? It will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Creative Commons CC0 1.0 Universal Public Domain Dedication. CC0 1.0

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.tdl.org/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (2.1)

Major Release (3.0)

Publish Dataset

This dataset cannot be published until Texas Robotics is published by its administrator.

Publish Dataset

This dataset cannot be published until Texas Robotics and University of Texas at Austin Dataverse Collection are published.

Return to Author

Return this dataset to contributor for modification.

Reproduction Data for: Models of Human Preference for Learning Reward Functions

Introduction

Data Collection

Game Design

Human Subjects

Dataset Organization and Contents

Results Summary

Code

Data Reuse

Bulk Data Download