Sometimes it can be difficult to fairly compare different object detectors. After all, no single object detector can fit every data model. Understanding the data can give us a sense of direction in terms of what architectures to use. Recently, I set out to test which models or strategies could be used to improve the detection of small scale objects. I ran some experiments with the RetinaNet implementation from the paper Focal Loss for Dense Objects. After experimenting with it, I can see why RetinaNet is a popular one-stage object detection model that can work well with small-scale objects. In this post, I will explain my experience of training RetinaNet for detecting small cars on satellite imagery.
RetinaNet introduced two improvements over existing single-stage object detection models: the use of Feature Pyramid Networks (FPN) and Focal Loss. For a more in-depth explanation, I suggest reading a descriptive blog called The intuition behind RetinaNet. Below we can see a figure from the paper Focal Loss for Dense Objects explaining the network architecture.
For the experiments, I used the Keras implementation of RetinaNet that can be found in Fizyr‘s Github repository. A great advantage of this approach is that it can be quickly and easily downloaded and installed following the instructions in the repository. In my case, I created a container with this repository beforehand. I had access to the following commands on the terminal:
Later in the blog, I will explain how I used each of these commands. After the environment was set up, I needed to gather the following to train RetinaNet:
- The pre-processed dataset
- A backbone model from the ResNet family.
I used a small car dataset, where the images were a subset of satellite images from the Xview Dataset. The training dataset had 104,069 annotations in a CSV file with each annotation containing the path, bounding box annotations and class name. There was also another CSV file that contained the class name and mapping with each line following the format below:
And lastly, in the repository, there were available ResNet model weights, from which I used the default ResNet50.
Before I started the training, I checked whether the annotations were correctly formatted for the model or not. I did this by using the retinanet_debug command as follows:
|retinanet-debug csv data-volume/train/train_cars.csv data-volume/train/classes.csv|
This command outputs the annotations, when they are colored green it means that anchors are available and the color red indicates no anchors are available. This means that the annotations that do not have anchors available won’t contribute to training. The default anchor parameters work well for most objects. Since I was working on a dataset of aerial images, some objects were smaller. I noticed I had red annotations, indicating that my dataset had smaller objects than the default anchor configuration and that I needed to change my anchor configuration. For choosing the anchor box parameters I used the repository Anchor Optimization. It helped me calculate the best anchor parameters for the dataset by recommending the following ratios and scales:
|Final best anchor configuration
Ratios: [0.544, 1.0, 1.837]
Scales: [0.4, 0.519, 0.636]
Number of labels that don’t have any matching anchor: 190
I found the default configuration of anchor parameters and saw that there were other parameters apart from ratio and scales, but that it is not advised to change them in this RetinaNet implementation. Sizes correlate to how the network processes an image, and strides should correlate to how the network strides over an image. I only needed to change the different ratios and scaling factors to use per anchor location, so I saved the new configuration in a config.ini file like this:
sizes = 32 64 128 256 512
strides = 8 16 32 64 128
ratios = 0.544 1.0 1.837
scales = 0.4 0.519 0.636
After that, I was ready to start training, but I first needed to understand the retinanet-train command parameters:
The parameters I decided to use were:
- snapshot-path – path to store snapshots of models during training
- tensorboard-dir – log directory for Tensorboard output
- config – path of configuration parameters .ini file.
- csv – this flag followed by train csv path and class csv path
There were other parameters that could be changed including epochs, weights, backbone. I decided to keep the defaults with those parameters. Since I was working with a server with multiple GPUs, I had to specify before which GPU I wanted to use with CUDA_VISIBLE DEVICES and specify where to save the output:
|CUDA_VISIBLE_DEVICES=2 retinanet-train –snapshot-path data-volume/outputs/snapshots-cars/ –tensorboard-dir data-volume/outputs/tensorboard/dir –config data-volume/train/config.ini csv data-volume/train/train_cars.csv data-volume/train/classes.csv &> data-volume/outputs/output_retinanet.txt|
After executing the training task I monitored the progress. After I saw it was successfully training, I confirmed it had the correct configuration. The model was slow to train, running overnight for almost 18 hours. After it completed, I converted it for evaluations. I converted the model using the retinanet-convert-model command. To convert the model I needed:
- Path of snapshot to convert
- Path to save model
- Path of configuration file
After I gathered those paths, I ran the following command:
|retinanet-convert-model data-volume/outputs/snapshots-april24/resnet50_csv_50.h5 –config data-volume/train/config.ini data-volume/outputs/models/model_cars.h5|
While figuring out how to convert the model I caught errors along the way. I also had to convert the model more than once because I didn’t include the config file, so Remember to use the config.ini file. After the model was saved it was ready to be evaluated. For this I needed:
- The path to save the inference results
- The path of the configuration file
- The path of the csvs with the annotation and classes of dataset to evaluate
- The path of the model to evaluate
After gathering those paths I first evaluated the model on the training set using the retinanet-evaluate command as follows:
|CUDA_VISIBLE_DEVICES=2 retinanet-evaluate –save-path data-volume/outputs/inference_results-optimization/ –config data-volume/train/config.ini csv data-volume/train/train_cars.csv data-volume/train/classes.csv data-volume/outputs/models/model_cars.h5|
After running evaluations on the training set it resulted in a mean average precision (mAP) of 0.8015. Then I used a similar command but with a different path of csv on the test set and that resulted in an mAP of 0.5090, which indicated the model was generalizing well.
Summarizing, I learned that knowing the data helped training with RetinaNet, initialization of the network also plays an important role, and having the correct anchor parameters helps improve the performance significantly. To compare results, I also did an experiment where with this dataset I trained RetinaNet using the default anchor parameters and got approximately 0.2843 mAP. I recommend using this RetinaNet implementation because it is simple to use and I obtained good results without much customization.