A First Look at Attention Mechanisms in Deep Learning
In recent years, deep learning can solve many computer vision problems better than traditional models. This is mainly due to the millions of image databases like ImageNet that greatly facilitate the training of deep neural networks. On such a large database, the artificially designed features in the traditional model are too shallow and do not necessarily classify the entire database effectively. However, deep neural networks can learn features in very different styles based on the class of the image against this database, which can be used for visual tasks such as classification. To better learn to have targeted features, we need to focus our attention on what interests us about the database. Don't overestimate my ability, I certainly can't browse every image in the world and mark it. But we can get the deep learning model to actively focus on places of interest. That's the attention mechanism I'm going to cover today.
The attention mechanism is actually quite simple, let's do a test, like the following picture :
The first thing I noticed was a tree in the middle of the picture, then noticed the text above the picture. The main reason why I noticed the tree is because it has a much richer message compared to the drafting ground, the ocean and the sky. The attention mechanism automatically takes our eyes towards the location of this tree. Specifically, there are two main factors in the attention mechanism, one is "content" and the other is "location". In the image above, we see mainly "a tree", positioned slightly down the middle of the image.
In deep learning, There are also many descriptions of attention mechanisms。 I recently read an article published in this year'sECCV Papers at conferences“CBAM: Convolutional Block Attention Module”。 This article focuses on proposing a very simple and effective module, Putting the two themes in the attention mechanism"what" harmony"where" Use it perfectly.。 next, Let me describe in detail how this is done。 Let's look at the following diagram:
This diagram depicts the process by which the input features become better. Assuming that we extract the appropriate features using the network of deep learning models, we can refine the features extracted at each layer.
As shown above, we can start by refining the features of different channels in each layer of the network by computing some simple statistical properties, such as mean and maximum, for the features in each channel (one channel corresponds to one matrix). The mean value corresponds to mean sampling, and the maximum value corresponds to maximum sampling. This is done by calculating a mean and a maximum for the feature matrix in each channel. The results of the feature matrix computation in all channels are fed into a multilayer perceptron (MLP), trained, and finally the mean and maximum values are fused to obtain an attention mechanism to solve the "content". That is, this mechanism reinforces the content information in the input features. The process is shown in the following diagram.
After addressing the improvement of the "content" feature, we can move on to address the improvement of the "location" feature. As shown in the figure below.
We can calculate the mean and maximum of all channel elements at each pixel position in each feature matrix, which gives us two matrix features, equivalent to having two channel position features. But how do these two feature layers merge? The authors have designed a volume base here to get the best way to fuse these two matrices by training the parameters, which means I don't care how you fuse them, as long as the final result is good, you fuse them that way. This solves the problem of improving the "location" feature.
After these two improvements, the features learned by deep neural networks have been greatly improved, so how do we fuse these two improvements into an existing neural network? See the following chart.
We can seamlessly embed improvements in "content" and "location" features into existing mainstream networks, such as residual networks.
This mechanism does bring great benefits. The authors confirmed its validity through numerous classification experiments.
First, the validity of the "content" feature is verified.
We can see that in each channel, the network is improved to some extent by using only mean sampling or maximum sampling. Where mean sampling reduced the error rate from 24.56% to 23.14% and maximum sampling reduced the error rate from 24.56% to 23.20%. A further reduction in the classification error rate (from 24.56% to 22.80%) is obtained if both samples are used.
The validity of the "location" feature is then verified.
Based on the improvement of the "content" feature, we proceed to the improvement of the "location" feature. As seen in the table above, we can eventually reduce the error rate from 23.14% to 22.66%.
So how do you determine the order in which to improve the "content" and "location" features? In other words, should "content" come before "location" or "location" before "content"? Not to worry, the authors have done the corresponding experiments, as shown in the following table.
It can be seen that improving the "content" feature first and then the "location" feature is the most helpful in reducing the error rate.
The article also uses this mechanism in various current state-of-the-art networks and performs a comparison of classification error rates on the Imagenet dataset and finds that various networks get lower error rates due to these feature-improved mechanisms.
Finally, the authors also show what exactly happens to the underlying image with the improvement of these "content" and "location" features (as shown in the figure below). By comparing the two rows of images, ResNet50 and ResNet50+CBAM, we can see that these feature on feature improvements do make the deep learning model pay more attention to the important objects and important locations in the images.
The authors probably felt that just doing the classification was not enough to make a point and ended up doing experiments on both MS COCO and VOC2007 databases for object detection.
We first look at the experimental results on MS COCO, as shown in the following figure, we see that after the introduction of the "content" feature and the "location" feature improvement, comparing the results of the two rows of ResNet and ResNet+CBAM, the deep network does detect a certain degree of improvement.
The same conclusion can be verified for the VOC2007 database.
This paper demonstrates through a simple and effective design and extensive experimental validation that improved "content" and "location" features can be used to obtain better classification and detection results in the field of image classification and object detection. This design can be seamlessly embedded into existing mainstream networks.
Postscript: this post is just my initial exposure to the attention mechanism. I'm also currently using attention mechanisms in some problems in computer vision, and I'll continue with a good article on this next time. Stay tuned.