Why do convolutional neural networks dominate the field of computer vision?

Source: Turing Artificial Intelligence

Abstract: Convolutional neural networks have achieved the best current results in machine vision and many other problems, and its success prompts us to think about the question, why are convolutional neural networks so effective? In this article, the mystery behind convolutional neural networks will be analyzed.

** origin of ideas**

Among the various deep neural network structures, convolutional neural network is the most widely used one, which was proposed by LeCun in 1989 [1]. Convolutional neural networks were successfully applied to handwritten character image recognition in the early days [1][2][3]. The deeper AlexNet network [4] was successful in 2012, and since then convolutional neural networks have flourished and been used in a wide range of fields, achieving the best current performance on many problems.

Convolutional neural networks automatically learn the features of an image at each level through convolution and pooling operations, which is in line with our common sense for understanding images. Humans perceive images in layers of abstraction, first understanding color and brightness, then local detail features such as edges, corner points, and straight lines, followed by more complex information and structures such as textures and geometric shapes, and finally forming the concept of the whole object.

Research in Visual Neuroscience on the mechanics of vision validates this conclusion that the visual cortex of the animal brain has a hierarchical structure. The eye images what it sees on the retina, which converts optical signals into electrical signals that are transmitted to the visual cortex, the part of the brain responsible for processing visual signals. In 1959, David and Wiesel conducted an experiment[5] in which they inserted electrodes into the primary visual cortex of the cat's brain, displayed bands of light of various shapes, spatial locations, and angles in front of the cat's eyes, and then measured the electrical signals put out by the cat's brain neurons. It was found that the electrical signal was strongest when the light band was at a certain position and angle; different neurons had different preferences for various spatial positions and orientations. This achievement later earned them a Nobel Prize.

It has now been shown that the visual cortex has a hierarchical structure. The signals from the retina first reach the primary visual cortex, or V1 cortex. Simple neurons in V1 cortex are sensitive to some detailed, direction-specific image signals. After processing in V1 cortex, the signal is conducted to V2 cortex. The V2 cortex represents edge and contour information as simple shapes, which are then processed by neurons in the V4 cortex, which is color information sensitive. Complex objects are ultimately represented in the IT cortex (inferior temporal cortex).

visual cortex structure

Convolutional neural networks can be seen as a simple imitation of such mechanisms as above。 It consists of multiple convolutional layer configure (computing)， everyone convolutional layer Contains multiple convolution kernels， Using these convolution kernels from left to right、 Scan the entire image in order from top to bottom， Get the feature map called（feature map） The output data of the。 Network front convolutional layer Capture Image Partial、 Detailed information， There are small feelings wild， That is, each pixel of the output image utilizes only a very small range of the input image。 back convolutional layer Feel the wildness increase layer by layer， For more sophisticated image capture， More abstract information。 After several convolutional layer arithmetic， Finally the abstract representation of the image at each different scale is obtained。

** convolution operation**

Convolution of one-dimensional signals is a classical tool in digital signal processing， In the field of image processing， Convolution is also a common operation。 It is used for image denoising、 enhance、 Problems such as edge detection， Features of the image can also be extracted。 convolution operation Top-down with a matrix called the convolution kernel、 Slide from left to right on the image， Multiply the individual elements of the convolution kernel matrix with the elements at the corresponding positions it covers on the image， Then sum up， Get the output pixel value。 withSobel Edge detection operator as an example， Its convolution kernel matrix is：

Suppose the matrix of the input image is a 3x3 subimage centered at (x,y) as.

The convolution result at that point is calculated as follows.

That is, the sub-images centered at (x,y) are multiplied by the corresponding elements of the convolution kernel and then summed. By kernel convolution acting on all positions of the input image, we can obtain the edge map of the image. Edge plots have larger values at edge locations and values close to zero at non-edge locations. The following figures show the results of convolution of the image by the Sobel operator, with the input image on the left and the convolved result on the right.

Convolution results for the Sobel operator

As you can see from the above figure, the edge information of the image is brought out by convolution. In addition to the Sobel operator, there are commonly used Roberts , Prewitt operators, etc., which implement convolution in the same way , but with different convolution kernel matrices. If we use other different kernels, more general image features can also be extracted. In image processing, the values of these convolution kernel matrices are designed manually. By some means, we can automatically generate these convolutional kernels by means of machine learning to describe various different types of features, and convolutional neural networks are used to obtain various useful convolutional kernels by means of this automatic learning.

** convolutional layer**

The convolutional layer is the core of a convolutional neural network. The following is a practical example to understand the convolution operation. If the image being convolved is.

The convolution kernel is.

First multiply the subimage at the first position of the image, i.e. the top left corner, with the corresponding element of the convolution kernel and then add them together, where the subimage is

The convolution results are.

Next slide a column to the right on the image to be convolved, and subimage the second position at.

convolution with the convolution kernel, resulting in 154. Next, sliding one more bit to the right and convolving the sub-image at the third position with the convolution kernel results in 166. After processing the first row, slide down one row and repeat the process above. And so on, the final convolution result image is obtained as

After the convolution operation, the image size becomes smaller. We can also expand (padding) the image first, e.g., by patching 0 around the perimeter, and then convolve it with the size expanded image, ensuring that the convolved result image is the same size as the original image. Also, during the sliding from top to bottom and left to right, the step size is 1 for both horizontal and vertical sliding, and we can use other step sizes as well.

Convolution operation is obviously a linear operation, while neural networks have to fit non-linear functions, so similar to fully connected networks, we need to add activation functions, commonly used are sigmoid function, tanh function, ReLU function, etc. We will talk about the explanation of activation functions, why they are needed, and what kind of functions can be used as activation functions in subsequent articles, so please follow us on our public website.

Previously we described the convolution of a single channel image with a two-dimensional array as input. In practice, we often encounter multi-channel images, such as RGB color images with three channels, and in addition, since each layer can have multiple convolution kernels, the resulting output is also a multi-channel feature image, when the corresponding convolution kernel is also multi-channel. This is done by convolving each channel of the input image with each channel of the convolution kernel, and then accumulating the pixel values at the corresponding positions according to each channel.

Since multiple convolution kernels are allowed per layer， Output multiple feature images after convolution operation， therefore the firstL size convolutional layer The number of channels of the convolution kernel must be the same as the number of channels of the input feature image， i.e. equal to paragraphL-1 size convolutional layer The number of convolution kernels in the。

A simple example is shown in the following figure.

multichannel convolution

In the above figure the input image for the convolution layer is a 3-channel one (column 1 in the figure). Correspondingly, the convolution kernel is also 3-channel. In the convolution operation, the image of the corresponding channel is convolved with the convolution kernel of each channel separately, and then the individual channel values at the same location are accumulated to obtain a single channel image. In the above figure, there are 4 convolutional kernels, each of which produces a single channel output image, and 4 convolutional kernels produce a total of 4 channels of output images.

** battery compartment**

With the convolution operation, we have completed the downscaling and feature extraction of the input direction image, but the dimensionality of the feature image is still very high. The high dimensionality is not only time consuming to compute but also tends to lead to overfitting. For this purpose the technique of downsampling, also known as pooling i.e. pooling operation, was introduced.

Pooling is done by replacing a region of the image with a value, such as the maximum or average value. If the maximum value is used, it is called max pooling; if the mean value is used, it is called mean pooling. In addition to reducing the image size, another benefit from downsampling is translation and rotation invariance, since the output value is computed from a region of the image and is not sensitive to translation and rotation.

The following is a practical example to understand the down sampling operation. The input image is.

Here the overlap-free 2x2max pooling is performed and the resulting image is

The first element 11 in the resulting image is the 2x2 subimage in the upper left corner of the original image.

The maximum value of the element 11. The second element 9 is the second 2x2 subimage.

element has a maximum value of 9, and so on for the others. If mean downsampling is used, the result is

The specific implementation of the pooling layer is to chunk the obtained feature image after the convolution operation, the disjoint blocks into which the image is divided, and calculate the maximum or average value within these blocks to obtain the pooled image.

Both mean pooling and max pooling can accomplish the downsampling operation, the former is a linear function while the latter is a nonlinear function, and in general max pooling has better results.

** network structure**

A typical convolutional neural network consists of convolutional layer、 battery compartment、 Fully connected layer composition。 Here it is in the form ofLeNet5 Network to illustrate， The following diagram shows the structure of this network：

The input to the network is a grayscale image of， due to3 size convolutional layer，2 size battery compartment，1 The full connectivity layer consists of。 The first two convolutional layer All followed by a battery compartment。 The output layer has10 neuron， express0-9 these10 numbers。

** application**

Machine vision is the first area where deep learning techniques are making breakthroughs， also application Widest range of areas。 (located) atAlexNet after appearing， Convolutional neural networks are quickly being used for a variety of tasks in machine vision， Includes universal target detection、 Pedestrian detection、 face detection、 facial recognition、 semantic segmentation of images、 Edge detection、 Target tracking、 Various issues such as video classification， Both have been successful.。

Most of the problems in the field of natural language processing are time series problems, which is something that recurrent neural networks excel at. However, for some problems, the use of convolutional networks can also be modeled and gives good results, typically text classification and machine translation.

In addition to this, convolutional neural networks have applications in other directions such as speech recognition and computer graphics.

** convolutional layer visualisation**

Convolutional networks were originally designed by convolutional layer harmony battery compartment Level-by-level extraction of image features at each different level of abstraction， We would have questions like： Is that really how it turned out in reality?？

First, look at the result after image convolution. Here is an image of the truck.

Truck images

usefulnessAlexNet After network processing， the first convolutional layer The output of the（ We ranked the results of the individual convolution kernels in order） Here's the deal.：

Results for volume layer 1

You can see some edge information extracted here。 second reason convolutional layer The output of this looks like this：

Results for volume layer 2

It extracts features from the larger region。 The last few convolutional layer The result is this：

Results for volume layers 3-5

The results of convolution layers 3-5 are lined up sequentially in the figure above. Then we look at the fully connected layer, and the following figure shows the output of the 3 fully connected layers from the top down.

Results for the fully connected layer

Let's look again at the visualization results of the convolution kernel. The image of the convolution kernel for the first convolution layer is shown below.

convolutional layer1 the convolution kernel of

It can be seen that these convolution kernels are indeed extracting information about edges, directions, etc. Looking again at the convolution kernel of the 2nd convolution layer.

convolutional layer2 the convolution kernel of

It looks cluttered and doesn't respond to much information. Is there a better way? The answer is yes, and a number of articles have appeared addressing the problem of convolutional layer visualization. Here, we present a typical approach to visualize the effect of convolution kernels through the deconvolution operation.

The literature [6] devised a scheme to visualize the convolutional layer with a deconvolution operation. This is done by left multiplying the feature images learned by the convolutional network to obtain the transpose matrix of the convolutional kernel of these feature images, and projecting the picture features from the feature image space to the pixel space to discover which pixels activate particular feature images for the purpose of analyzing and understanding the convolutional network. This operation is called deconvolution, also called transpose convolution.

For the convolution layer, the transpose matrix of the convolution kernel during forward propagation is used in the deconvolution operation to convolve the feature image, and the feature image is reduced to the original pixel image space to obtain the reconstructed image. The visualized image of the convolution kernel obtained by the deconvolution operation is shown in the following figure.

Visualization by deconvolution

The above figure shows that the features extracted by the previous layers are relatively simple, being some color, edge features. The further back the convolutional layers extract the more complex features, being some complex geometric shapes. This is in line with our original design intent for convolutional neural networks, which is to accomplish layer-by-layer feature extraction and abstraction of images through multilayer convolution.

Another way to analyze the mechanism of the convolutional network is to reconstruct the original input image directly from the convolutional result image, if the original input image can be reconstructed from the convolutional result, it means that the convolutional network retains the information of the image to a large extent. The literature [7] devised a method to observe the expressive power of a convolutional network by representing the image backwards with the features extracted by the convolutional network. Here, the reverse representation refers to the approximate reconstruction of the original input image by the vector encoded by the convolutional network. This is done by finding an image given a vector encoded by a convolutional network that best matches the given vector after such an image is encoded by the convolutional network, which is achieved by solving an optimization problem. The following figure shows the image reconstructed from the convolution output.

Convolutional Image Reconstruction

Where the top row is the original image and the bottom row is the reconstructed image. From this result, it can be seen that the convolutional neural network does extract useful information from the image.

** theoretical analysis**

The theoretical explanation and analysis of convolutional neural networks comes from two sources. The first aspect is the analysis from a mathematical point of view, the mathematical analysis of the representational capabilities of the network, the mapping properties; the second aspect is the study of the relationship between convolutional networks and the animal visual system, the analysis of the relationship between the two helps to understand, design better methods, and also contributes to the progress of neuroscience.

** Mathematical properties**

Neural networks represent the connectionist idea in artificial intelligence, a bionic approach that is seen as a simulation of the nervous system of an animal brain. When it comes to implementation, it is again different from the structure of the brain. Mathematically, a multilayer neural network is essentially a composite function.

Since a neural network is by nature a complex composite function, this leads us to ponder the question: how well can this function be modeled? That is, what kind of objective function can it model? It has been shown that approximation of any continuous mapping function from an input vector to an output vector can be achieved using a neural network with three layers, i.e., containing a hidden layer, as long as the activation function is chosen properly and the number of neurons is sufficiently large [8][9][10], a conclusion known as the universal approximation (UA) theorem.

The case when using the sigmoid activation function is demonstrated in the literature [10]. The literature [8] states that the universal approximation property does not depend on the specific activation function of the neural network, but is guaranteed by the structure of the neural network.

The universal approximation theorem is formulated as follows: if is a non-constant, bounded, monotonically increasing continuous function that is a unit cube of m dimensions, the space of continuous functions in

. For any

and

function, there exist integers N, real numbers, real vectors

, through which they construct functions as approximations to the function f .

arbitrary

fulfillment

An intuitive interpretation of the universal approximation theorem is that a function such as the one above can be constructed to approximate any continuous function defined in the unit cube space to any specified accuracy. This conclusion is similar to polynomial approximation, which uses polynomial functions to approximate any continuous function to any accuracy. The significance of this theorem is that it theoretically guarantees the fitting ability of the neural network.

But this is only a theoretical result, how many layers and how many neurons per layer are needed for a neural network when implemented in concrete terms? These issues can only be determined through experimentation and experience to ensure effectiveness. Another problem is the training samples, to fit a complex function requires a large number of training samples and faces the problem of overfitting. The details of these engineering implementations are also crucial; convolutional networks have been around since 1989, so why did they not succeed until 2012? The answer is several things.

1.Limitations on the number of training samples. The early training samples were very small and not collected at scale enough to train a complex convolutional network.

2.Limits of computing power. Computer power was too weak in the 1990s to have high-performance computing technology like GPUs, and it was unrealistic to train a complex neural network.

3.The problem with the algorithm itself. Neural networks have long suffered from the problem of vanishing gradients, as each layer is multiplied by the value of the derivative of the activation function during backpropagation, and if the absolute value of this derivative is less than 1, the gradient quickly converges to 0 after a large number of times, making it impossible for the previous layers to be updated.

AlexNet The size of the network, especially the number of layers, is deeper than the previous network， UsedReLU as the activation function， Abandonedsigmoid harmonytanh function， Somewhat mitigates the problem of disappearing gradients。 add intoDropout machine processed， It also mitigates the overfitting problem。 These technical improvements， add intoImageNet Such a large sample set， andGPU of computing power， It's guaranteed to be a success.。 The latter study showed， Increasing the number of layers in the network、 Number of parameters， Can significantly increase the accuracy of the network。 For these questions， Will be covered in detail in a later feature article， Interested readers can follow us on the public。

Convolutional neural networks are essentially weight-sharing fully connected neural networks, so the universal approximation theorem applies to them. But the convolutional layer of the convolutional network, the pooling layer, again has its own properties. The literature [11] explains deep convolutional networks from a mathematical point of view. Here, the authors view the convolutional network as scattering the data with a set of cascaded linear weighting filters and nonlinear functions. The modeling capabilities of deep convolutional networks are explained by analyzing the compression (contraction) and separation (Separation) properties of this set of functions. In addition, the migration properties of deep neural networks are explained. The convolution operation of a convolutional neural network is divided into two steps, the first step is the linear transformation and the second step is the activation function transformation. The former can be seen as a linear projection of the data into a lower dimensional space; the latter is a compressed nonlinear transformation of the data. The authors analyze the separation and compression properties of each of these transformations.

**Relationship to the visual nervous system**

The relationship between convolutional networks and the human brain visual system has important implications for the interpretation and design of convolutional networks, which are divided into two aspects. The first question is whether deep convolutional neural networks can achieve similar performance to that of the human brain's visual system, which involves a comparison of the capabilities of the two. The second question is whether the two are structurally coherent, which is an analysis of the relationship between the two in terms of system structure.

On a deeper level, this issue is also one that AI cannot avoid. One question many people have is: do we have to understand the mechanics of how the brain works in order to achieve an AI equivalent to it? There are two views on the answer to this question. The first view is that we have to figure out how the brain works before we can develop an AI that is functionally equivalent to him. The second view is that even if we don't figure out how the brain works, we can develop an AI with comparable capabilities to it. An example of this is the process of inventing the airplane. For a long time, people have tried to build airplanes by imitating the way birds fly, i.e. by flapping their wings, and all have failed. And the use of propellers allowed us to use a different method to get planes flying as well, and the jet engines that came later even allowed us to break the speed of sound and be far more powerful than birds. In fact, the brain may not be the only solution for achieving intelligence that has the same functionality as it does.

The first problem is analyzed in the literature [12]. They demonstrated that deep neural networks can perform the same as primates' visual IT cortex. The visual nervous system of the human brain can still achieve high recognition performance in the case of sample changes, geometric changes, and background changes, mainly due to the representation of the lower temporal cortex inferior temporal cortex, or IT cortex. Through the model of deep convolution neural network training, the performance of object recognition is also very high. There are many difficulties in comparing the two precisely.

The authors compared deep neural networks with IT cortex using an extended kernel analysis technique. This technique uses the generalization error of the model as a function of the complexity of the representation. The analysis results show that the performance of deep neural networks on visual target recognition tasks can be obtained from the representation capabilities of the brain's IT cortex.

Neural networks versus visual cortex capabilities

The correspondence between deep neural networks and visual nerves is also analyzed. They used goal-driven deep learning models to understand the sensory cortex of the brain. The idea is to model the output responses of individual cells and groups in the high visual cortex region using goal-driven layered convolutional neural networks ( target-driven hierarchical convolutionalal neural networks , or HCNNs ) . This approach establishes a correspondence between deep neural networks and the brain's perceived cortex, which can help us understand the mechanism of the visual cortex. From another point of view, the corresponding point of deep neural network in neuroscience has also been found. The following image shows the structure and function of neural networks and the visual cortex:

Structural comparison of neural networks and visual cortex

Research on the mechanism and theory of how deep neural networks work is still incomplete, and brain science is still at a relatively low level. It is believed that in the future, through continuous human efforts, we will be able to understand the working mechanism of the brain more clearly, and also be able to design more powerful neural networks.

** bibliography**

[1] Y.LeCun, B.Boser, J.S.Denker, D.Henderson, R.E.Howard, W.Hubbard, and L.D.Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989.

[2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In David Touretzky, editor, Advances in Neural Information Processing Systems 2 (NIPS*89), Denver, CO, 1990, Morgan Kaufman.

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, november 1998.

[4] Alex Krizhevsky, Ilya Sutskever, Geoffrey E.Hinton. ImageNet Classification with Deep Convolutional Neural Networks.

[5] Hubel D. H, T. N. Wiesel. Receptive Fields Of Single Neurones In The Cat's Striate Cortex. Journal of Physiology, (1959) 148, 574-591.

[6] Zeiler M D, Fergus R. Visualizing and Understanding Convolutional Networks. European Conference on Computer Vision, 2013.

[7] Aravindh Mahendran, Andrea Vedaldi. Understanding Deep Image Representations by Inverting Them. CVPR 2015.

[8] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. 1991, Neural Networks.

[9] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359-366, 1989.

[10] Cybenko, G. Approximation by superpositions of a sigmoid function. Mathematics of Control, Signals, and Systems, 2, 303-314, 1989.

[11] Stephane Mallat. Understanding deep convolutional networks. 2016, Philosophical Transactions of the Royal Society A.

[12] Charles F Cadieu, Ha Hong, Daniel Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon, Najib J. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. 2014, PLOS Computational Biology.

[13] Daniel Yamins, James J Dicarlo. Using goal-driven deep learning models to understand sensory cortex. 2016, Nature Neuroscience.