Paper Summary: End-to-End Object Detection with Transformers

Original Paper
Date: May 26 2020
Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Code

Abstract

New approach to object detection that performs end-to-end object detection as a direct set prediction problem. This approach streamlines the detection pipeline, and removes the need for some postprocessing techniques used in other detection methods. These post (and pre) processing steps include NMS and anchors/proposal generation.

The novel thing about this new approach is that it uses a set-based global loss and a transformer encoder-decoder architecture. The athors state that “Given a fixed small set of learned object quieries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.”

The authors reframe the problem as a image-to-set problem, which is suitable for Transformers. The authors combine a CNN to extract local iamge features with the Transformer encoder-decoder architecture to get a full Deep Learning detector.

Introduction

New proposal for object detection methods that shifts away from the traditional methods of creating surrogate regression and classification problems run on some sort of proposal regions ( or anchors). They adopt an encoder-decoder architecture based on transformers. We have seen transformers advance translation and speech recognition, but not on object detection.

Inspiration for use of transformers seems to come from the fact that transformers are good at explicityl modeling all pairwise interactions between elemnts in a sequence (think words being scanned for attention in a translation model). This could make these kinds of architectures pretty good at doing a set prediction where duplicates are removed.

Ease of use: this method can be implemented with any language that supports CNNs and transformer layers. Can also be very easily extended to perform panoptic segmentation tasks with minimal changes.

Accuracy: Performs better on large objects than the Faster R-CNN benchmark, but performs worse on small objects

Set prediction

As promised in the Introduction, one of the things this method allows the detector to skip is NMS. They propose a loss function that is based on the hungarian algorithm (also used in people tracking) along with bipartite matching to match ground truth to predictions.

Transformers and Parallel Decoding

Transformers have started to replace some RNN based methods for sequence to sequence problems. They have a couple of properties that make them great for speech translation for example. These properties are self-attention parts which aggregate information on the entire sequence. One drawback however, is the inference speed and difficulties in batching predictions. Combining transformers and parallel decoders helps increase prediction speed.

The DETR model

According to the authors there are two esential things needed for this entire architecture to work.

A “set prediction loss”. This needs to force unique matching between the ground truth boxes and the predicted boxes from the model. Why is this important? When you are training your detector, it will output 5 boxes of where it finds 5 cars for example. Your loss has to be calculated by asking the model to measure “how far off was this aprticular box to its ground truth”, but there are no box IDs or ways for your model to know “which prediction goes with which ground truth”.
An “architecture that predicts (in a single pass) a set of objects and models their relation. “

Object Detection

Concepts to keep in mind/review for this paper

NMS (Non maximum supression)

Technique used in some object detectors to remove duplicate boxes. Many detectors generate anchors or region proposals and then need a way to remove boxes that are essentially finding the same thing. The algorithm is basically:

For a given class:

Take a list of all boxes the detector finds along with the confidence scores. Take the box with the highest confidence and add it to the final list, while removing it from the original proposed list.
Calculate the IoU with all other boxes in the proposed set and if the IoU with any of the proposed boxes and the one with the highest score is higher than the desired threshold (.7 for example), remove all those boxes.
Take the box with the highest confidence from the proposed list and move it to the final list.
Calculate the IoU with all other boxes in the proposed set and if the IoU with any of the proposed boxes and the one with the highest score is higher than the desired threshold (.7 for example), remove all those boxes.
Repeat until there are no more boxes in the proposed list.

Panoptic Segmentation

Source: https://arxiv.org/abs/1801.00868

We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance).

Transformers

source: https://arxiv.org/pdf/1706.03762.pdf

Architecture that proves that attention-only mechanisms (without RNNs like LSTMs) can improve on the results in translation or seq2seq tasks.

Attention

source: https://arxiv.org/pdf/1706.03762.pdf

The attention mechanism looks at an input sequence and decides at each step which other parts of the sequence are most “important”. This information gets passed along in translation encode-decoders to help the seq2seq model.

Key Takeaways

DETR demonstrates accuracy and run-time performance on par with FasterRCNN.
Removes need for things like NMS and anchor/proposals in detectors.