YOLO - You only look once 10647 times

Authors: Christian Limberg, Andrew Melnik, Helge Ritter, Helmut Prendinger

(currently in review for publication at ICONIP conference)

Abstract

In this article we are revealing that the "You Only Look Once" (YOLO) single-stage object detection approach can be compared to a parallel classification of 10647 fixed region proposals. We support this narrative by showing by two complimentary approaches, that each of YOLOs output pixel is attentive to a specific sub-region of previous layers, comparable to a local region proposal. This understanding reduces the conceptual gap between YOLO-like single-stage object detection models, RCNN-like two-stage region proposal based models, and ResNet-like image classification models. This page shows interactive exploration tools and exported media for a better visual understanding of the YOLO information processing streams.

Figure 1

A simplified schematic of the YOLO.v4 network architecture. Download figure

Figure 2

With our interactive visualization, the full grid layers of the YOLO.v4 network can be depicted for several images. The YOLO architecture has 3 different pathways for recognizing objects of different sizes. The recognition heads are located in 2d-grids of different resolutions. Each grid element can detect underlying objects based of 3 possible anchor box shapes. Each anchor box refines estimates of the x- and y-position, the width and the height, a confidence value and a probability vector of each class used for training. The bounding boxes are labeled with the predicted class, the certainty value and an index of the displayed anchor box (we depict only the most confident anchor box out of the 3 possible). Object proposals with a high certainty are colored blue. By the top button rows, the pathway and the input image to be visualized can be selected. By hovering over the figure, the respective output pixel and the detected bounding box of the most certain anchor box is visualized.

Figure 3

Shifting the actual input image below the output pixels makes apparent how the confidences of the anchor boxes (blue means high confidence) are shifted and neighboring cells get activated. By the top button rows, the pathway and the input image to be visualized can be selected. By hovering over the figure, the input image is shifted.
X
Y
Width
Height
Confidence
Probability (for class person)
Layer 75
Layer 103
Layer 104
Layer 105

Figure 4

Our adapted Detection Grad Cam visualizes the saliency map of a single output neuron. The columns represent the saliency maps of the x-shift, y-shift, width, height, confidence and probability neuron of a selected output pixel of the large pathway (13x13 output pixels). As rows, we depict saliency maps for convolutional layers 75, 103, 104 and 105. Each plot represents the saliency map averaged over 15 images (15x13x13) having class "person" under the corresponding output pixel's position. By hovering over the left area, a spatial output pixel can be selected. The saliency maps for the different layers and output neurons of the selected output pixel are visualized.

Figure 5

Video demonstrating the optimization process of several classes of the COCO data set by our proposed deep detection dream approach.

Figure 6

By our proposed deep detection dream approach, we can generate objects in specific spatial image regions. By hovering over the image, a spatial output pixel can be selected. The image is then optimized for containing an object at this position.

Figure 7

By our proposed detection dream approach, we can generate objects with specific attributes like in this case an adjustable height. By dragging the slider below the image, the object of class person is adjusted for having a height from 10% to 100% of the image height.