Facebook AI Lab's Kai-Ming He et al. propose SlowFast network for video recognition
This article is reprinted from. Heart of the Machine
【guide (e.g. book or other printed material)】 In this article.，FAIR Kai-Ming He et al. presented the video recognition for SlowFast Network， Propose to treat spatial structures and temporal events separately。 The model is powerful in video motion classification and detection： Without using any pre-training， (located) at Kinetics Current best level achieved on the dataset； (located) at AVA Motion detection is also implemented on the dataset 28.3 mAP Current best level。
Treating the spatial dimensions x, y in the image I(x, y) symmetrically in image recognition is a convention, justified by the statistics of natural images. Natural images possess isotropy (all directions have the same probability) and translational invariance in the first approximation [38, 23]. What about the video signal I(x, y, t)? Action is the spatiotemporal counterpart of direction , but not all spatiotemporal directions possess the same possibilities. The likelihood that slow motion is more likely than fast motion (indeed, the world as we see it is mostly stationary at a given moment) has been exploited in the use of Bayesian models to describe how humans perceive motion stimuli . For example, if we see an isolated moving edge, we consider it to be moving perpendicular to itself, although in principle it could also have an arbitrary moving component tangent to itself (the problem of aperture in optical flow). This perception is justified if the former tends to be slow-motion.
If not all spatio-temporal directions possess the same likelihood, then there is no reason to view space and time symmetrically, as is the case with video recognition methods based on spatio-temporal convolution [44, 3]. Instead, we need to 'decompose' the architecture, dealing with spatial structures and temporal events separately. Put this idea into the context of identification. The category space semantics of visual content usually changes very slowly. For example, waving does not change the recognition result of 'hand' during the time this action is performed; someone is always under the category of 'person', even if he/she switches from walking to running. Thus the recognition of category semantics (and its color, texture, light, etc.) can be refreshed at a relatively slow rate. On the other hand, the execution of an action can change much faster than its subject recognition, such as clapping, waving, shaking the head, walking, or jumping. The use of fast refresh frames (high temporal resolution) is required to effectively model motion that may change rapidly.
Based on this intuition, this study demonstrates a two-path SlowFast model for video recognition (see Figure 1). One of the paths is designed to capture the semantic information provided by the image or several sparse frames, and it runs at a low frame rate with a slow refresh rate. And the other path is used to capture fast-changing actions, which have a fast refresh rate and high temporal resolution. Despite this, the path is very light in volume, e.g., only about 20% of the total computational overhead. This is because the second path has fewer channels and is less capable of processing spatial information, but this information can be provided by the first path in a less redundant manner. Based on their different time speeds, the researchers named them the Slow path and the Fast path, respectively. The two are fused through a lateral connection (lateral connection).
This concept brings flexibility and efficiency to the design of video models. Being lighter on its own, the Fast path does not need to perform any time pooling - it can run at high frame rates across all intermediate layers and maintain time fidelity. At the same time, Slow paths can focus more on the spatial domain and semantics due to the lower temporal rate. By processing the original video at different temporal rates, the method allows both paths to model the video in their own unique ways. The researchers thoroughly evaluated the method on the Kinetics [27, 2] and AVA  datasets. On the Kinetics action classification dataset, the method achieves 79% accuracy without any pre-training (e.g., ImageNet), which greatly exceeds the best level in the literature (exceeding 5.1%). Controlled variable experiments demonstrate the improvements resulting from the SlowFast concept. On the AVA motion detection dataset, the SlowFast model reached a new current best level of 28.3% mAP.
The approach is partly inspired by studies of the biology of retinal ganglion cells in the primate visual system [24, 34, 6, 11, 46], although the analogy is somewhat crude and unsophisticated. It was found that of these cells, ~80% were small cells (P-cells) and ~15-20% were large cells (M-cells). M-cells operate at a higher temporal frequency and are more sensitive to temporal changes, but not to spatial detail and color. P-cells provide good spatial detail and color, but lower temporal resolution. The SlowFast framework is similar: i) the model has two paths that work at low and high temporal resolution; ii) the Fast path is used to capture fast changing motion but with less spatial detail, similar to an M-cell; and iii) the Fast path is light, similar to a smaller scale M-cell. The researchers hope that these relationships will inspire more computer vision models for video recognition.
Thesis: SlowFast Networks for Video Recognition