cool hit counter The big bully teaches you object detection using 7 types of convolutional neural networks!_Intefrankly

The big bully teaches you object detection using 7 types of convolutional neural networks!

-Welcome aboard.AI Community of technical experts>>

When we talk about making predictions about images, so far we've talked about classification. We've asked which number is between 0 and 9, whether this picture is a shoe or a shirt, or whether it's a cat or a dog in this picture below.

But realistic pictures can be more complex, and they may contain more than just a subject object. Object DetectionIt is for this type of problem that is proposed, it is not only to analyze what is in the picture, but it needs to identify where it is located. We use the image discussed in the Introduction to Machine Learning chapter as a sample and label it with the main object and location.

Several differences can be seen between object detection and picture classification.

Image classifiers usually only need to output a classification of the main object in the image. However, object detection must be able to identify multiple objects, even though some objects may not dominate the picture. Technically speaking, this task is generally called Multi-object detection But the vast majority of studies are for multi-class settings, so we've removed "multi-class" here for simplicity

The image classifier only needs to output the probability of identifying the image objects into a certain class, but object detection requires the output of not only the recognition probability, but also the location of the object in the image. This is usually a box that encloses the object, also commonly referred to as a bounding box(bounding box)。

However, it is also seen that object detection has similarities to image classification in that both determine the main object contained in a region of an image. Thus it is conceivable that the convolutional neural network-based image classification we introduced earlier could be applied here.

In this chapter we present the ideas of several convolutional neural network-based object detection algorithms.


Fast R-CNN:

Faster R-CNN:

Mask R-CNN:




R-CNN: Regional Convolutional Neural Network

This is the seminal work on object detection based on convolutional neural networks. The core idea is in selecting multiple regions for each image, then each region as a sample into a convolutional neural network to extract features, and finally using a classifier to align the classification, and a regressor to get the accurate edges.


Specifically, the algorithm has the following steps.

Use a rule-based "selective search" algorithm for each input image to select multiple proposed regions

As in fine-tuned migration learning, a pre-trained convolutional neural network is selected and the last input layer is removed. Each region is adjusted to the input size required by this network and the output is calculated. This output will be used as a feature of this region.

Use these region features to train multiple SVMs to do object recognition, with each SVM predicting whether a region contains a certain object

Using these regional features to train the linear regressor will propose regions

Intuitively R-CNN is well understood, but the problem is that it can be particularly slow. A single image we may pick thousands of regions, resulting in thousands of predictions to be made for a single image. Although unlike fine-tuning, training here can be done without updating the convolutional neural network used to draw features, thus we can count the features for each region beforehand and save them. But for predictions, we can't avoid this. thus making it difficult for R-CNNs to be used in practice.

Fast R-CNN: A fast regional convolutional neural network

Fast R-CNN makes two main improvements to R-CNN to improve performance.

Considering that a large number of regions inside the R-CNN may be covering each other, it is too wasteful to re-draw features each time. Thus Fast R-CNN first extracts features from the input image and then selects regions

Instead of R-CNN using multiple SVMs to do classification, Fast R-CNN uses a single multi-class logistic regression, which is what was used by default in the previous tutorial.

Fast R-CNN

As can be seen from the schematic, the region selected using selective search is acting on the features extracted by the convolutional neural network. This way we only need to do feature extraction once for the original input image, which saves a lot of repeated computations.

Fast R-CNN proposes a Region of Interest (RoI) pooling layer, which has as input features and a series of regions, for each of which it divides them uniformly into


n × m small regions and do max pooling for each small region to obtain a


The output of n × m. Thus regardless of the size of the input region, the RoI pooling layer pools it into a fixed size output.

Here we take a closer look at how the RoI pooling layer works, assuming that for an image we propose a


4 × 4 feature and the number of channels is 1.

In [1]:



[[[[ 0. 1. 2. 3.] [ 4. 5. 6. 7.] [ 8. 9. 10. 11.] [ 12. 13. 14. 15.]]]]

We then create two regions, each represented by a vector of length 5. The first element is the label of its corresponding object, followed by , , , and respectively. Here we generate the


3 x 3 and


Two areas of 4×3 size.

The output size of the RoI pooling layer is. It can be used as an ordinary batch with a sample size of yes into the other layers for training.

In [2]:




[[[[ 5. 6.] [ 9. 10.]]] [[ 9. 11.]] [ 13. 15. ]]]] Faster R-CNN: A faster regional convolutional neural network

Fast R-CNN follows the selective search method of R-CNN to select regions. This one is usually very slow. The main improvement made by Faster R-CNN is to propose Regional proposal network (region proposal network, RPN) to replace selective search. Here's how it works.

A 3 × 3 convolution with a padding of 1 channel is 256 is placed on the input features. Thus each pixel, along with its surrounding 8 pixels, is not mapped into a vector of length 256.

to generate several k default borders of pre-designed size and aspect ratio centered on each pixel, often also called anchor frame

For each border, using the 256-dimensional vector corresponding to its center pixel as a feature, RPN trains a 2-class classifier to determine whether the region contains any object of interest or just background, and a 4-dimensional output regressor to predict a more accurate border.

For all the anchor frame, The number of individuals isnmk If the input size isn×m, Select the object judged to be still available, Then the edges of the former their corresponding regressor predictions are put as input into the nextRoI battery compartment

Faster R-CNN

Although it looks a bit complicated, the RPN idea is very intuitive. First some pre-configured regions are proposed, and then a neural network is used to determine if these regions are of interest, and if so, then a more accurate border is predicted. This way we can effectively reduce the cost of searching for borders of any shape.

Mask R-CNN

Mask R-CNN (located) atFaster R-CNN A new pixel-level prediction layer is added on, It is not only important for a anchor frame Predict its corresponding class and the real border, And it will determine this anchor frame class which object each pixel corresponds to or just the background。 The latter is the problem to be solved by semantic segmentation。Mask R-CNN Using the fully connected convolutional network that we will introduce later(FCN) to complete this forecast。 This of course means that the training data must have pixel-level annotation, Instead of a simple border。

Mask R-CNN

owing toFCN will accurately predict the class of each pixel, That is, each pixel in the input image will correspond to a category in the annotation。 For an input image in a anchor frame, We can match exactly to the corresponding region in the pixel annotation。 neverthelessPoI Pooling is applied to the features after convolution, Its default is to set the anchor frame Spotting done。 For example, suppose the chosen anchor frame be(x,y,w,h), and feature extraction makes the image smaller16 times (multiplier), That is, if the original image is256×256, Then the feature size is16×16。 At this point in the feature corresponding to the anchor frame It just became(x/16,y/16,w/16,h/16)。 in casex,y,w,h Any one of them not being16 exact division, Then a misalignment may occur。 for the same reason, In the sample above we see, in case anchor frame whose length and width are not divisible by the pooled size, Then the same will be fixed, Thus bringing dislocation。

Usually such a misalignment is only between a few pixels and has little effect on classification and border prediction. But for pixel-level predictions, such a misalignment could cause big problems. Mask R-CNN proposes a RoI Align layer, which is similar to the RoI pooling layer, but with the fixation step removed, that is, with all ⋅ removed. If the calculated table frame is not just between pixels, then we use the surrounding pixels to linearly interpolate to get the value at this point.

For the one-dimensional case, suppose we want to compute the value f(x) at point x. Then we can interpolate with the values of the whole points around x.

What we actually want to use is a two-dimensional difference to estimate f(x,y), we first differ on the x-axis to get f(x,y) and f(x,y+1) and then differ based on these two values to get f(x,y).

SSD: Single shot multiframe detector

In the R-CNN family of models. The regional proposal and classification is done in two pieces. SSD unifies them into a single step to make the model simpler and faster, which is why it is called single-fire The reason.

It is different from Faster R-CNN in two main ways

as far as sth is concerned anchor frame, We no longer first determine if it contains the object of interest, Then the positive class anchor frame Classification into real objects。SSD In this case we directly use a class classifier to determine which class of object it corresponds to, Or is it just background。 We also no longer have additional regressors to predict the edges any further, Instead, a single regressor is used directly to predict the true edge。

SSD does not just make predictions on the features output from the convolutional neural network, it will further make the features smaller by convolving and pooling layers to make predictions. This achieves the effect of multi-scale prediction.


The specific implementation of SSD will be detailed in the next chapter.

YOLO: just need to watch it once

Whether it is Faster R-CNN or SSD, a large number of the anchor frames they generate are still overlapping each other, resulting in a large number of regions still being double-computed. YOLO tries to come to grips with this problem. It cuts the image features evenly into S x S blocks, each of which serves as an anchor frame. Each anchor frame predicts B borders, and which object this anchor frame primarily contains.


YOLO v2: Better, Blockier, Stronger

YOLO v2 improves YOLO in a number of areas, notably.

Using a better convolutional neural network to do feature extraction, using a larger input image 448×448 makes the feature output size increase to 13×13

Instead of using anchor frames that are uniformly cut, clustering is done on the real anchor frames in the training data, and then the center of the clusters is used as the anchor frame. The number of anchor frames can be significantly reduced relative to SSD and Faster R-CNN.

No longer in useYOLO of the fully connected layer to predict, but rather withSSD Same use of convolution。 For example, suppose that the use of5 size anchor frame( clusters are5 kind), Then the number of channels used for object classification is yes1×1 convolution, Number of channels used for border regression.


We describe several object detection algorithms based on convolutional neural networks. The common denominator between them is that they first propose anchor frames and use convolutional neural networks to extract features later to predict the main objects they contain and more accurate borders. However, they each differ in their choice of anchor frames and predictions, leading to trade-offs in the practicality and accuracy of their calculations.

original text:

1、The first thing you need to do is to get a good idea of what youre doing
2、Ministry of Industry and Information Technology online platforms and mobile apps need to provide cancellation services
3、Zhu Xiaohu not into the 3 oclock community some money would rather not earn
4、Honor 10 photo interface revealed to support automatic retouching
5、Blockchain Coin Flipping Emojis First Round

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送