Introduction
As mobile service robots increasingly operate in human-centered environments, they must learn to use elevators without modifying elevator hardware. This task traditionally involves processing an image of an elevator control panel using instance segmentation of the buttons and labels, reading the text on the labels, and associating buttons with their corresponding labels. In addition to the standard approach, our project also implements an additional segmentation step where missing buttons and labels are recovered after the first feature detection pass. In a robust system, both the first segmentation pass and the recovery models’ training data requires pixel-level annotations of buttons and labels, while the label reading step needs annotations of the text on the labels. Current elevator panel feature datasets, however, either do not provide segmentation annotations, or do not draw distinctions between the buttons and labels.
The “Living With Robots Elevator Button Dataset” was assembled for purposes of training segmentation and scene text recognition models on realistic scenarios involving varying conditions such as lighting, blur, and position of the camera relative to the elevator control panel. Buttons are labeled with the same action as their respective labels for purposes of training a button-label association model. A pipeline including all steps of the task mentioned was trained and evaluated, producing state-of-the-art accuracy and precision results using the high quality elevator button dataset.
Dataset Contents
- 400 jpeg images of elevator panels.
- 292 taken of 25 different elevators across 24 buildings on the University of Texas at Austin campus.
- 108 sourced from the internet, with varying lighting, quality, and perspective conditions.
- JSON files containing border annotations, button and label distinctions, and text on labels for the Campus and Internet Sub-Datasets.
- PyTorch files containing state dictionaries with network weights for:
- The first-pass segmentation model, a transformer-based model trained to segment buttons and labels in a full-color image: “segmentation_vit_model.pth”.
- The feature-recovery segmentation model, a transformer-based model trained to segment masks of missed buttons and labels from the class map output of the first pass: “recovery_vit_model.pth”.
- The scene text recognition model, trained from PARSeq to read the special characters present on elevator panel labels: “parseq_str.ckpt”.
- Links to the data loader, training, and evaluation scripts for the segmentation models hosted in GitHub.
The data subsets are all JPGs collected through 2 different means. The campus subset images were taken in buildings on and around the University of Texas at Austin campus. All pictures were taken facing the elevator panel’s wall roughly straight-on, while the camera itself was positioned in each of nine locations in a 3x3 grid layout relative to the panel: to the top left, top middle, top right, middle left, center, middle right, bottom left, bottom middle, and bottom right. A subset of these also includes versions of each image with the elevator door closed or open, varying the lighting and background conditions. All of these images are 3024 × 4032, and were taken with either an iPhone 12 or 12 Pro Max.
The Internet subset deliberately features user-shared photos with irregular or uncommon panel characteristics. Images in this dataset vary widely in terms of resolution, clarity, button/label shape, and angle of the image to add variety to the dataset and robustness to any models trained with it.
Data Segmentation
The segmentation for this dataset served two training purposes. First, they were used to identify the pixels that comprise the elevator buttons and labels in the images. A segmentation model was then trained to accurately recognize buttons and labels in an image at the pixel-level. The second use, and the one that most distinguishes our approach, was training a separate model to recover missed button and label detections. The annotations were used to generate class maps of each, before being procedurally masked to provide a data ground-truth (the remaining masks) and a target (the hidden masks) for the recovery model.
Data Annotation Method
All annotations were done with the VGG Image Annotator published by the University of Oxford. All images were given their own set of annotations, identified in their file naming convention. Regarding the segmentation annotations, any button that was largely in-view of the image was segmented as one of several shapes that most closely fit the feature: rectangle, ellipse, or polygon. In the annotation JSONs, these appeared as either the coordinates of each point of a polygon or as the dimensions of an ellipse (center coordinates, radius dimensions, and angle of rotation). Additionally, each feature was designated as a “button” or “label”. For retraining the model that reads text on labels, each label and its corresponding button has a “pair” feature with the action of said pair. There were some special characters. For instance, the emergency stop sign was labeled “stop”, the alarm bell was labeled “alarm”, and the phone was labeled “call”, but most standard buttons like floor numbers (as in “4” for “Floor 4”) or letters (as in “DH” for “Door Hold” or “G” for “Ground”) were labeled with the text directly observed on the label (“4”, “DH”, or “G” for the aforementioned examples).
Dataset Organization and Contents
The files are organized into two directories, “Analysis” and “Panel Data”. “Analysis” contains the PyTorch model weights: “segmentation_vit_model.pth”, “recovery_vit_model.pth”, and “parseq_str.ckpt”. “Panel Data” contains subfolders “mixed” and “ut_west_campus” with the annotations and dataset pictures for the “Internet” and “Campus” sub-datasets respectively.
The Campus sub-dataset comprises 292 images taken of 25 different elevators across 24 buildings. These are titled with the format “{building name}_{photo number of that building}.jpg” – as an example, “ahg_2.jpg”, “ahg_3.jpg”, “utc_1.jpg”, “utc_2.jpg”, etc.
The Internet sub-dataset includes 108 images sourced from the Internet. These are titled with the format “mixed_{number}.jpg,” with the number going from 0 to 107.
Network of Weights/Result Summary
The entire elevator feature detection solution incorporates two segmentation models trained on the dataset, which must be assessed based on both accuracy and Average Precision at 50% Intersection over Union (AP50) scores. Subsequently, the entire pipeline's performance must be evaluated by its accuracy when recovering requested buttons. The test partition of this dataset was used to evaluate all desired metrics. Our paper demonstrates that the models trained with this dataset consistently perform better on all relevant metrics than the state of the art models, while the comprehensive pipeline successfully recovers the segmentations of buttons corresponding to the desired floor at a greater rate than prior systems.
Data Quality
This dataset was compiled by a team of researchers. The images were assembled as a group effort, so all data was verified by each person to be representative of an actual perspective a robot might gather in the field. No pre-treatment or cleaning steps were taken in relation to the jpgs. As the annotations for segmentation model training data need to be exact, two members verified they were as precise and accurate as possible, down to the pixel where the border of a button began. To train the first-pass detector model, the team found sufficient robustness was achieved after applying PyTorch transformations to vary the data, like rotation and resolution changes. When using this dataset to train the feature recovery network, the team applied procedural masks to parts of the data during different training iterations to provide a variety of collections of initially detected features.
The Americans with Disabilities Act (ADA) guidelines state that elevators in the United States must have a particular grid-like layout and organization with the labels to the direct left of the buttons, though label and button shapes are alloted far more flexibility. This dataset represents a vast majority of elevator control panel styles and layouts across the U.S. according to these rules, as well as varying button and label shapes. With that in-mind, there are some elevators that do not abide by these strictures, with an example being some panels where the labels and buttons are one and the same. However, despite the dataset not including non-typical examples, some independent tests have shown that models trained on the existing data are still capable of detecting or recovering abnormal feature detections.
Code
The GitHub Repository with the data loader, training, and inference code can be found here.
Data Reuse
The dataset was curated to be used in training elevator button and label segmentation, action reading, or association models. This is where it would be best reused. Researchers who would like to implement an elevator button and label recognition system for use in human-centered environments, like office buildings or museums, can find use from this dataset.
Bulk Data Download
A script named download_data.py is provided for bulk data download.