As can be seen from the schematic, the region selected using selective search is acting on the features extracted by the convolutional neural network. This way we only need to do feature extraction once for the original input image, which saves a lot of repeated computations.
Fast R-CNN proposes a Region of Interest (RoI) pooling layer, which has as input features and a series of regions, for each of which it divides them uniformly into
n × m small regions and do max pooling for each small region to obtain a
The output of n × m. Thus regardless of the size of the input region, the RoI pooling layer pools it into a fixed size output.
Here we take a closer look at how the RoI pooling layer works, assuming that for an image we propose a
4 × 4 feature and the number of channels is 1.
[[[[ 0. 1. 2. 3.] [ 4. 5. 6. 7.] [ 8. 9. 10. 11.] [ 12. 13. 14. 15.]]]]
We then create two regions, each represented by a vector of length 5. The first element is the label of its corresponding object, followed by , , , and respectively. Here we generate the
3 x 3 and
Two areas of 4×3 size.
The output size of the RoI pooling layer is. It can be used as an ordinary batch with a sample size of yes into the other layers for training.
[[[[ 5. 6.] [ 9. 10.]]] [[ 9. 11.]] [ 13. 15. ]]]] Faster R-CNN: A faster regional convolutional neural network
Fast R-CNN follows the selective search method of R-CNN to select regions. This one is usually very slow. The main improvement made by Faster R-CNN is to propose
Regional proposal network (region proposal network, RPN) to replace selective search. Here's how it works.
A 3 × 3 convolution with a padding of 1 channel is 256 is placed on the input features. Thus each pixel, along with its surrounding 8 pixels, is not mapped into a vector of length 256.
to generate several k default borders of pre-designed size and aspect ratio centered on each pixel, often also called
For each border, using the 256-dimensional vector corresponding to its center pixel as a feature, RPN trains a 2-class classifier to determine whether the region contains any object of interest or just background, and a 4-dimensional output regressor to predict a more accurate border.
For all the anchor frame， The number of individuals isnmk If the input size isn×m， Select the object judged to be still available， Then the edges of the former their corresponding regressor predictions are put as input into the nextRoI battery compartment
Although it looks a bit complicated, the RPN idea is very intuitive. First some pre-configured regions are proposed, and then a neural network is used to determine if these regions are of interest, and if so, then a more accurate border is predicted. This way we can effectively reduce the cost of searching for borders of any shape.
Mask R-CNN (located) atFaster R-CNN A new pixel-level prediction layer is added on， It is not only important for a anchor frame Predict its corresponding class and the real border， And it will determine this anchor frame class which object each pixel corresponds to or just the background。 The latter is the problem to be solved by semantic segmentation。Mask R-CNN Using the fully connected convolutional network that we will introduce later(FCN) to complete this forecast。 This of course means that the training data must have pixel-level annotation， Instead of a simple border。
owing toFCN will accurately predict the class of each pixel， That is, each pixel in the input image will correspond to a category in the annotation。 For an input image in a anchor frame， We can match exactly to the corresponding region in the pixel annotation。 neverthelessPoI Pooling is applied to the features after convolution， Its default is to set the anchor frame Spotting done。 For example, suppose the chosen anchor frame be(x,y,w,h)， and feature extraction makes the image smaller16 times (multiplier)， That is, if the original image is256×256， Then the feature size is16×16。 At this point in the feature corresponding to the anchor frame It just became(x/16,y/16,w/16,h/16)。 in casex,y,w,h Any one of them not being16 exact division， Then a misalignment may occur。 for the same reason， In the sample above we see， in case anchor frame whose length and width are not divisible by the pooled size， Then the same will be fixed， Thus bringing dislocation。
Usually such a misalignment is only between a few pixels and has little effect on classification and border prediction. But for pixel-level predictions, such a misalignment could cause big problems. Mask R-CNN proposes a RoI Align layer, which is similar to the RoI pooling layer, but with the fixation step removed, that is, with all ⋅ removed. If the calculated table frame is not just between pixels, then we use the surrounding pixels to linearly interpolate to get the value at this point.
For the one-dimensional case, suppose we want to compute the value f(x) at point x. Then we can interpolate with the values of the whole points around x.
What we actually want to use is a two-dimensional difference to estimate f(x,y), we first differ on the x-axis to get f(x,y) and f(x,y+1) and then differ based on these two values to get f(x,y).
SSD: Single shot multiframe detector
In the R-CNN family of models. The regional proposal and classification is done in two pieces. SSD unifies them into a single step to make the model simpler and faster, which is why it is called
single-fire The reason.
It is different from Faster R-CNN in two main ways
as far as sth is concerned anchor frame， We no longer first determine if it contains the object of interest， Then the positive class anchor frame Classification into real objects。SSD In this case we directly use a class classifier to determine which class of object it corresponds to， Or is it just background。 We also no longer have additional regressors to predict the edges any further， Instead, a single regressor is used directly to predict the true edge。
SSD does not just make predictions on the features output from the convolutional neural network, it will further make the features smaller by convolving and pooling layers to make predictions. This achieves the effect of multi-scale prediction.
The specific implementation of SSD will be detailed in the next chapter.
YOLO: just need to watch it once
Whether it is Faster R-CNN or SSD, a large number of the anchor frames they generate are still overlapping each other, resulting in a large number of regions still being double-computed. YOLO tries to come to grips with this problem. It cuts the image features evenly into S x S blocks, each of which serves as an anchor frame. Each anchor frame predicts B borders, and which object this anchor frame primarily contains.