Skip to content Skip to footer

Bringing it On During DIUx xView 2018 Detection Challenge

Deep Learning Object Finder Model in Satellite Images

Wovenware’s Data Science team has been working with state-of-the-art Convolutional Neural Networks (CNNs) for some time now as part of our deep learning and machine learning AI projects.  To share this experience with others in the industry, we recently participated in the DIUx xView 2018 Detection Challenge. The contest asked participants to bring their best solutions that push the boundaries of computer vision, to develop new solutions for national security and disaster response.  It revolved around the use of one of the largest publicly available datasets of overhead imagery, xView.

The xView dataset contains images from different regions of the world annotated using bounding boxes. It was chipped from images collected with DigitalGlobe’s WorldView-3. Comprised of 846 high resolution satellite images in TIF format and 600K+ bounding box annotations belonging to 60 categories, xView is considered to be a dense, imbalanced dataset because each category can have anywhere from dozens to hundreds of thousands of annotations.

As part of the challenge, which took place April to August 2018,  we at Wovenware trained object detection models, packaged them inside Docker containers, and gave the containers a private validation cluster provided by DIUx. We did this iteratively for the duration of the challenge.

While this challenge was just that – quite challenging – we learned some very important lessons from our efforts.  Below are a few lessons learned, as well as our strategy moving through the competition and challenges we overcame along the way.

Tools and Techniques for Dataset Exploration, Cleanup, and Preparation Need to be Methodically Planned

Our first decision was what strategy to use for the dataset exploration, cleanup and preparation. We decided to perform visual inspection of the data using QGIS, geopandas and Jupyter Notebook. Next, we chipped out a custom dataset with various resolutions ranging from 300px2 to 1024px2 using the chipper code published by DIUx as part of its data utilities repository. After the dataset was created we were eager to immediately start training our baseline models, yet we noticed some anomalies in the dataset.

First, we encountered problems related to the integrity and availability of the dataset. Even though the dataset passed our local checksum tests we were missing data and a clean download did not become available for more than a week. This forced us to stop training our first round of models until we could get our hands on the complete dataset.

Before creating new datasets, we also decided to enhance the chipper code provided by DIUx to include the tail of each TIF, which otherwise was being ignored, resulting in chipped datasets that were missing features present in their original TIFs. Figure 1 below shows an example of a set of 512px2 chips produced by the vanilla DIUx chipper. As you can see, the image shows 81 cells in a 9×9 grid where each cell would become a chip and a tail which is not large enough to accommodate another row or column of chips. For dataset creation we added a sliding window over the tail that overlapped with chips from the last row and column. For performing inferences, we padded TIFs to make their size a multiple of the chip size. This latter approach was used because we were short on time to implement Non-Maximal Suppression on overlapping chips at inference time.

DIUx chipper output
Figure 1: Visualization of DIUx chipper output missing tail features.

We then decided to perform in depth manual and code-based explorations of the dataset to find other anomalies before resuming model training. This was done with custom scripts for removing corrupted annotations such as negative coordinates or coordinates that exceeded TIF dimensions, which would place the object outside the image. From this work, we found a lot of faulty bounding boxes, as did other xView participants. Unfortunately, mis-categorized training samples were impossible to detect without a full-on QA session of all the 600K+ annotations. After cleaning up our uncorrupted dataset we were finally ready to resume training.

Our eagerness for CNN training to begin got the best of us and rather than relying on checksums as the only way to verify data integrity, we began training before performing the full exploration, which proved to be a mistake. We also learned that it is hard to systematically detect all anomalous annotations in such a large dataset and we had to live with the fact that we could not detect nor remove incorrectly categorized bounding boxes from our training data.

The first lesson learned is aligned with an old saying – The Early Bird Catches the Worm – although in our case the worm is not a good reward.  Yet, early data exploration will most certainly reward you with early bug discoveries.

Working with a Dense, Unbalanced Dataset Requires Time Management and Planning

As I mentioned before xView is a dense, unbalanced dataset and our proposed solution heavily considered those factors. At first, we considered using either model ensembles; multi-headed object detectors for multiple subsets of the original 60 categories; and Focal Loss, which Facebook AI Research (FAIR) has used to improve state-of-the-art results of single shot detectors for dense, unbalanced datasets. In the end we only tried model ensembles and Focal Loss with Retinanet.

Our model ensembles were based on various lists of object classes grouped by either object type, object size, physical object appearance, or per class bounding box count. As you can imagine we ended up with a long list of potential experiments – in the hundreds – that only kept growing to ensure we always had more possible solutions to try. Experiment priorities were assigned based on estimated training times and the incremental nature of most of our experiments. The number of models submitted as part of any given ensemble was limited by their total execution time. This was because the DIUx established execution time constraints for all containerized submissions required that 282 TIFs be processed on a single CPU with 8GB of RAM in under 72 hours.

We learned that planning a wide range of experiments ahead of time helped prioritize our efforts in accordance with checkpoint deadlines and execution time constraints. Time management and planning was imperative because as Benjamin Franklin once said: “You may delay, but time will not.”

This is the first in a two-part series that explores the  valuable lessons learned as we competed in the DIUx xView Detection Challenge.  Leslie collaborated in the writing of these posts. Stay tuned for the next post, which explores how we achieved semi-automated model training, validation and results documentation.

Get the best blog stories in your inbox!