Facebook AI Lab's Kai-Ming He et al. propose SlowFast network for video recognition
This article is reprinted from. Heart of the Machine
【guide (e.g. book or other printed material)】 In this article.,FAIR Kai-Ming He et al. presented the video recognition for SlowFast Network, Propose to treat spatial structures and temporal events separately。 The model is powerful in video motion classification and detection: Without using any pre-training, (located) at Kinetics Current best level achieved on the dataset; (located) at AVA Motion detection is also implemented on the dataset 28.3 mAP Current best level。
Presentation.
Treating the spatial dimensions x, y in the image I(x, y) symmetrically in image recognition is a convention, justified by the statistics of natural images. Natural images possess isotropy (all directions have the same probability) and translational invariance in the first approximation [38, 23]. What about the video signal I(x, y, t)? Action is the spatiotemporal counterpart of direction [1], but not all spatiotemporal directions possess the same possibilities. The likelihood that slow motion is more likely than fast motion (indeed, the world as we see it is mostly stationary at a given moment) has been exploited in the use of Bayesian models to describe how humans perceive motion stimuli [51]. For example, if we see an isolated moving edge, we consider it to be moving perpendicular to itself, although in principle it could also have an arbitrary moving component tangent to itself (the problem of aperture in optical flow). This perception is justified if the former tends to be slow-motion.
If not all spatio-temporal directions possess the same likelihood, then there is no reason to view space and time symmetrically, as is the case with video recognition methods based on spatio-temporal convolution [44, 3]. Instead, we need to 'decompose' the architecture, dealing with spatial structures and temporal events separately. Put this idea into the context of identification. The category space semantics of visual content usually changes very slowly. For example, waving does not change the recognition result of 'hand' during the time this action is performed; someone is always under the category of 'person', even if he/she switches from walking to running. Thus the recognition of category semantics (and its color, texture, light, etc.) can be refreshed at a relatively slow rate. On the other hand, the execution of an action can change much faster than its subject recognition, such as clapping, waving, shaking the head, walking, or jumping. The use of fast refresh frames (high temporal resolution) is required to effectively model motion that may change rapidly.
Based on this intuition, this study demonstrates a two-path SlowFast model for video recognition (see Figure 1). One of the paths is designed to capture the semantic information provided by the image or several sparse frames, and it runs at a low frame rate with a slow refresh rate. And the other path is used to capture fast-changing actions, which have a fast refresh rate and high temporal resolution. Despite this, the path is very light in volume, e.g., only about 20% of the total computational overhead. This is because the second path has fewer channels and is less capable of processing spatial information, but this information can be provided by the first path in a less redundant manner. Based on their different time speeds, the researchers named them the Slow path and the Fast path, respectively. The two are fused through a lateral connection (lateral connection).
This concept brings flexibility and efficiency to the design of video models. Being lighter on its own, the Fast path does not need to perform any time pooling - it can run at high frame rates across all intermediate layers and maintain time fidelity. At the same time, Slow paths can focus more on the spatial domain and semantics due to the lower temporal rate. By processing the original video at different temporal rates, the method allows both paths to model the video in their own unique ways. The researchers thoroughly evaluated the method on the Kinetics [27, 2] and AVA [17] datasets. On the Kinetics action classification dataset, the method achieves 79% accuracy without any pre-training (e.g., ImageNet), which greatly exceeds the best level in the literature (exceeding 5.1%). Controlled variable experiments demonstrate the improvements resulting from the SlowFast concept. On the AVA motion detection dataset, the SlowFast model reached a new current best level of 28.3% mAP.
The approach is partly inspired by studies of the biology of retinal ganglion cells in the primate visual system [24, 34, 6, 11, 46], although the analogy is somewhat crude and unsophisticated. It was found that of these cells, ~80% were small cells (P-cells) and ~15-20% were large cells (M-cells). M-cells operate at a higher temporal frequency and are more sensitive to temporal changes, but not to spatial detail and color. P-cells provide good spatial detail and color, but lower temporal resolution. The SlowFast framework is similar: i) the model has two paths that work at low and high temporal resolution; ii) the Fast path is used to capture fast changing motion but with less spatial detail, similar to an M-cell; and iii) the Fast path is light, similar to a smaller scale M-cell. The researchers hope that these relationships will inspire more computer vision models for video recognition.
Thesis: SlowFast Networks for Video Recognition
Links to papers:
http://www.zhuanzhi.ai/document/32700b9cccb12c754d7d2206906a64d6
summary: In this paper, we propose video recognition for SlowFast Network。 The model contains:1) One that runs at a low frame rate、 used to capture the spatial semantics of Slow trails;2) One that runs at a high frame rate、 Capturing motion with better temporal resolution of Fast trails。 We can reduce Fast Channel capacity of the path, Makes it very light, But still learn useful temporal information for video recognition。 Our model is robust in video motion classification and detection, And our SlowFast The dramatic improvement in concept implementation is an important contribution to the field。 We used no pretraining in Kinetics The dataset was implemented on the 79.0% Accuracy of, Well above the previous best level for such issues。 (located) at AVA Motion detection on the dataset, We made it happen. 28.3 mAP Current best level。 The code will be made public。
SlowFast Network
This generic architecture consists of a Slow path, a Fast path, and the two are linked by side connections. See Figure 1 for details.
Figure 1: The SlowFast network consists of a Slow path with a low frame rate and low temporal resolution and a Fast path with a high frame rate and high temporal resolution (α times the temporal resolution of the Slow path). Use a fraction of the number of channels (β, e.g. β = 1/8) to lighten the Fast path. The Slow and Fast paths are connected by a side connection. The sample is from the AVA dataset [17] (the sample is labeled: hand wave).
Table 1: Example instantiation of a SlowFast network. The dimension of the kernel is denoted by {T × S^2 , C}, where T denotes the temporal resolution, S denotes the spatial semantics, and C denotes the number of channels. The step length is denoted by {temporal stride, spatial stride^2}. Here the velocity ratio is α = 8 and the channel ratio is β = 1/8. τ = 16. Green indicates a higher temporal resolution for Fast paths and orange indicates a lower number of channels for Fast paths. The underline is the non-degenerate temporal filter (nDTF). The residual blocks are in square brackets. The backbone network is ResNet-50.
Experiment: Kinetics Action Classification
Table 2: Control variable experiments conducted on the Kinetics-400 action classification task. The above table shows the top-1 and top-5 classification accuracies (%), and the computational complexity (in GFLOPs) for a single clip input with a space size of 2242.
Figure 2: Slow-only (blue) vs. Training process of the SlowFast (green) network on the Kinetics dataset. The figure above shows the top-1 training error (dashed line) and the validation error (solid line). These curves are all single-crop errors with a video accuracy of 72.6% vs. 75.6% (see Table 2c).
a table (listing information) 3: SlowFast Network with the current optimal model in Kinetics-400 Comparison results on the dataset。
a table (listing information) 4: SlowFast Network with the current optimal model in Kinetics-600 Comparison results on the dataset。
Experiment: AVA Motion Detection
chart 3: (located) at AVA on the dataset Per-category AP:Slow-only Baseline model (19.0 mAP) vs. corresponding SlowFast Network (24.2 mAP)。 The bolded black category is the highest net growth 5 category, The orange category is the same as Slow-only AP > 1.0 Compare this to the highest relative growth of 5 category。 Categories are sorted by sample size。 note, This controlled variable experiment in which the SlowFast Instances are not our performance optimal model。
Table 5: AVA motion detection baseline: Slow-only vs. SlowFast.
Table 6: More examples of the SlowFast model on the AVA dataset.
a table (listing information) 7:SlowFast with the current optimal model in AVA Comparison on the data set。++ indicates that tests using multiscale and horizontal flip enhancement SlowFast Network version。
chart 4: SlowFast Network (located) at AVA Visualization results for performance on the dataset。 SlowFast Network (located) at AVA Prediction results on the validation set( greener, degree of confidence > 0.5)vs. truth-tag( red (color))。 Only the predictions for the intermediate frames are shown here/ label。 The image above shows the T ×τ = 8×8 of SlowFast models, Acquired 26.8 mAP。
-END-