Anchor Boxes in Object Detection: When, Where and How to Propose Them for Deep Learning Apps

June 15, 2020

Perhaps you’re well on your way to learning about computer vision and have studied in-depth, all about image classification and sliding window detectors. After grasping these concepts, the leap to understanding state-of-the-art (SOTA) object detection can often become daunting and obscure, especially when it comes to understanding anchor boxes. Needless to say, diving into a plethora of SOTA papers such as the popular YOLO, SSD, R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN, and RetinaNet to learn about anchor boxes for deep learning is an uphill endeavor. This becomes especially difficult when you have limited insight about what happens in the actual code.

What if I told you that you could harness the intuition behind anchor boxes for object detection using deep learning today?

This article sheds light on the obscure information about anchor boxes that can be found online. My goal is to help readers understand the following, as they continue on their object detection journeys:

  • What an anchor box is
  • How and where anchor boxes may be proposed over an image for object detection training
  • When anchor boxes can be proposed
  • How selected anchor boxes can be corrected during training to achieve a trained object detection model

So what is an anchor box?

The term anchor boxes refers to a predefined collection of boxes with widths and heights chosen to match the widths and heights of objects in a dataset. The proposed anchor boxes encompass the possible combination of object sizes that could be found in a dataset. This should naturally include varying aspect ratios and scales present in the data. It is typical to select between 4-10 anchor boxes to use as proposals over various locations in the image.

Within the realm of computer vision, deep learning neural networks have excelled at image classification and object detection. First there were sliding window detectors that localize single objects in a forward pass. Sliding window detectors have been replaced with single-shot and two-stage detectors, which are able to process entire images and output multiple detections. These object detectors rely heavily on the concept of anchor boxes to optimize the speed and efficiency of sliding window detection. This is because sliding window detectors require lots of forward passes to process an image, where many forward passes only process background pixels. See Figure 1 below for an illustration of a sliding window detector.

Figure 1: Sliding window detector

The typical task of training an object detection network consists of proposing anchor boxes or searching for potential anchors with traditional computer vision techniques, pairing proposed anchors with possible ground truth boxes, assigning the rest to a background category and training to correct input proposals. It is important to note that the concept of anchor boxes can be applied to predict a fixed number of boxes.

How and where are anchor boxes proposed over an image?

In essence, proposing anchors is about determining a collection of appropriate boxes that could fit the majority of objects in your data, placing hypothetical, evenly spaced boxes over an image, and creating a rule to map the outputs of a convolutional feature map to each position in the image.

To understand how anchor boxes are proposed, consider an object detection dataset of 256px x 256px images containing small objects, where most objects lie between 40px x 40px or 80px x 40px. Additional data wrangling could reveal that ground truth boxes are mostly squares with a 1:1 ratio of width to height or rectangles with a 2:1 ratio of width to height. Given this, the anchor boxes for this example dataset should be proposed considering at least both aspect ratios (1:1 and 2:1). The scales of those objects would refer to the length or width in pixels of an object as a proportion of the total length or width in pixels of its containing image. For example consider the width of an image = 256px = 1 unit, then a 40px wide object occupies 40px / 256px = 0.15625 units of width – the object occupies 15.62% of the total image width. To choose a set of scales that best represent the data we could consider object side measures with the most extreme values, i.e. the smallest minimum and largest maximum between all widths and heights of all objects in the dataset. If the largest and smallest scales in our example dataset are 0.15625 and 0.3125 and we were to pick say three scales for anchor box proposal, then three potential scales could be 0.15625, 0.234375, and 0.3125. If anchor boxes for this example dataset are proposed using the two aspect ratios mentioned above (1:1 and 2:1) and those three scales  (0.15625, 0.234375, and 0.3125) we would have a total of six anchor boxes to propose over multiple positions in any input image.

To grasp where these positions are, take a look at Figure 2 showing an evenly spaced 8×8 grid over an image. A bounding box can be proposed over each cell center. In fact we are talking about proposing six boxes over each position for a total of 384. In every position we could propose a box for each combination of aspect ratio and scale for a total of six boxes per grid center/location. Boxes of varying aspect ratios and scales are proposed in each position to cover all possibilities. This approach is taken by object detectors to exhaustively propose anchors.

Figure 2: Image with 8×8 grid over it

To obtain convolutional neural net predictions made for every position in the grid from Figure 2 consider a 4-channel 8×8 feature map, where each channel outputs the x, y, width, and height coordinates for one box in each position. For six boxes at each position, consider an 4*6-channel 8×8 feature map. SOTA architectures that make use of anchor boxes usually contain feature maps whose dimensions are multiples of 8. This is possible because convolutional neural nets essentially down-sample inputs while preserving important spatial features through 2D convolutions and pooling operations and fully convolutional layers output dense feature maps as shown in Figure 3.

Figure 3: Conv layers showing how down-sampling occurs and how each progressive feature map is smaller.

Now let’s talk about detecting objects that are smaller than the size of a grid cell – when your proposal grid is so course that you have single cells containing multiple small objects. This can be solved by proposing a finer grid and adjusting feature map output shapes accordingly. Better yet, you could use multiple grids and map them to different convolutional layers in the convolutional hierarchy as is the case with SSD and the Feature Pyramid Networks used by RetinaNet’s predictor heads.

In the next section we will discuss how the notion of proposing anchors over an image at various positions is needed when producing ground truth batches or interpreting predictions at inference time.

When are anchor boxes proposed over an image?

As mentioned above, a set of anchor boxes that describes your data is proposed once. It can be done at any moment before applying predicted offsets to each proposed anchor at each feature map position. Detectors don’t predict boxes, instead they predict a set of values for each proposed bounding box, mainly anchor box coordinate offsets and confidence scores for each category being learned. This means that the same anchors will always be proposed over every image and predicted offsets from a forward pass will be used to correct those proposals. The net has no notion of matching a feature map coordinate to a position within the image, nor that its output corresponds to an anchor box until the output is interpreted.

Theoretically, since every image will always be associated to the same set of fixed anchor proposals and ground truth doesn’t change during training, there is no actual need to propose anchors or match them with ground truth or background categories more than once. Of course this depends on your personal agenda regarding code optimization and how smart you can make your batch generator. Proposal and ground truth matching often both happen within a batch generator. Sometimes proposal generating layers are added to the actual net to add anchor data to the net’s output tensor, but the logic is supposed to be the same for generating and tiling proposals over an image within a batch generator.

Knowing this makes it easier to grasp when the actual proposal of a box over the image takes place or in other words, when the system actually needs to initialize and have this data structure in memory available for actual use – at batch generation time to match proposals with ground truth and at inference time to apply predicted offsets to proposals. It is at these points when the actual proposing of anchors must have already taken place.

Why are offsets learned instead of actual values?

Theoretically if a convolutional filter shines its receptive field on the same type of object twice it should output roughly the same values twice, regardless of where in the image the filter is shining its receptive field. This means that if an image contains two cars and output feature maps contain absolute coordinates, then the net would have predicted roughly the same coordinates for both cars. Learning anchor offsets allows for feature map outputs with similar offset outputs for those two cars, but the offsets are applied to anchors which are mappable to different positions in the input image. This is the main reason behind learning anchor box offsets during bonding-box regression.

Ground truth – matching anchors and generating batches

Ground truth batches must contain target offsets to learn and should contain proposed anchors. The latter is not used during training, but avoids having to associate anchors with offset predictions at inference time with an additional data structure and accompanying code. Target offsets should be the precise amount needed to move a proposal exactly over a matched ground truth box or zeros if it’s ground truth for a background box, since  a background box does not need correcting.

To recap, anchor-based batch generators construct a learning target where every proposed anchor for an image will be accounted for during training, regardless of whether it has been assigned to the foreground or background category. Following our example, a batch would start out with our 6 anchors in 64 positions totaling 384 anchor boxes. Each proposed anchor is possibly matched to a ground truth box with the following or a variant of these basic steps:

  • For each anchor find which ground truth box has the highest intersection over union (IOU) score
  • Anchors with an IOU greater than 50% are matched to the corresponding ground truth box
  • Anchors with an IOU greater than 40% are considered ambiguous and ignored
  • Anchors with an IOU less than 40% are assigned to the background category

Let’s explore how the batch ends up looking. Starting with the collection of all proposed anchors (384 in our example) a box matched to a ground truth box will contain its category and updated offsets to correct or move that anchor. Offsets to background and ambiguous/ignored boxes remain at their initial zero offset values. Again these offsets are the truth, the values we want to approximate with our neural net. These are the actual values being learned during our bounding box regression task. Deciding which background offsets will be considered for weight optimization and discarding unmatched boxes commonly occurs within the loss function.

Anchor boxes and calculating detector losses – how are anchor boxes corrected during training

Loss calculations don’t apply offsets to boxes. At that point the batch generator has already encoded offsets needed to “move” anchors exactly where the ground truth is and, as stated above, this is the location learning target for each proposed box matched or unmatched to a ground truth box. Unmatched anchor boxes should not contribute to the loss and are commonly ignored.

Recall that the net predicts an offset for all proposals at each feature map position. This means that ground truth data contains real offsets for anchors that were matched to a ground truth box, while ground truth offsets for background boxes are kept at zero. Once more, this is because once the pixel space within a proposed anchor is wholly considered background, the proposed anchor does not need coordinate adjustments. In addition, these zero values will be ignored because background anchor offsets do contribute to the regression loss. This is due to the fact that object detection is about learning to find foreground objects and the bounding-box regression loss (between predicted offsets and correct offsets) is usually minimized for foreground objects only. In other words, since anchors assigned to the background category are not supposed to be moved or corrected at all, there is no offset to be predicted and no significant values that could represent background boxes in the bounding-box regression loss.

A classification loss is commonly minimized using a subset of the total background boxes present in the ground truth to handle class imbalance. Remember there were six boxes per position for a total of 384 proposals in our example? Well most of those would be background boxes and that creates a significant class imbalance. A popular solution to this class imbalance problem is known as hard negative mining – choosing which background boxes will contribute to the loss according to a predetermined ratio (usually 1:3 ; foreground:background). Another popular approach for handling class imbalance within the classification loss involves down-weighting loss contributions of easily classifiable examples. Such is the case for RetinaNet’s Focal Loss.

To obtain a final set of object detections, the net’s predicted offsets are applied to their corresponding anchor boxes. There could be hundreds or thousands of proposed boxes, but in the end, current SOTA detectors ignore all boxes predicted as background, keep foreground detections that pass certain criteria, and apply non-maximal suppression to correct overlapping predictions of the same object.

As mentioned at the beginning of this article, the leap to understanding state-of-the-art (SOTA) object detection can often become daunting and obscure, but once you understand the role of anchor boxes, object detection takes on a whole new meaning.

Do you have questions about how Wovenware’s expertise in advanced object detection technologies can be applied to your project? Feel free to reach out to us at 877-249-0090 or info @ wovenware.com

Leave a Reply

  • (will not be published)