Five very powerful CNN architectures

This article is a technical blog compiled by AI Research and originally titled. Five Powerful CNN Architectures Author

Faisal ShahbazTranslation | Little Brother, Jaruce, zackary, Disillusion Proofreading | By Soyabanashi pineapple girl Link to original article. https://medium.com/@faisalshahbaz/five-powerful-cnn-architectures-b939c9ddd57b note：For links to this article, please click on the end of the article【 Read the original article】 Conducting visits

Let's take a look at some powerful convolutional neural networks that implement the deep learning that laid the foundation for today's achievements in computer vision.

LeNet-5, a 7-layer convolutional neural network, is used by many banks to recognize handwritten digits on checks.

* Gradient-based learning applied to document recognition*

The handwritten numbers are digitized into a picture with a size of 32*32. In this case, this technique cannot be applied to large scale images due to the limitation of computing power.

Let's understand the structure of this model. In addition to the input layer, this model has seven layers. Since the structure is very miniature, we examine the model layer by layer.

- Layer 1: Convolutional layer with a total of 6 convolutional kernels, kernel size 5*5 and step size 1*1. So, when the input image size is 32*32*1, the output size is 28*28*6. The number of parameters in this layer is 5*5*6+6 (the number of bias terms)
- Layer 2: Pooling layer with a total of 6 pooling nuclei with a nucleus size of 2*2 and a step size of 2*2. But the pooling layer here is slightly different from what we saw earlier. The pooling layer here, after summing the received input values, multiplies by a training-derived parameter (one for each kernel) and the result is added to a training-derived bias term (again, one for each kernel). Finally, the obtained results are mapped through the Sigmod activation function to obtain the output. Thus, the input size 28*28*6 inherited from the previous stage goes through this layer and will result in a 14*14*6 subsample. The number of parameters in this layer is [1 (parameters obtained by training) + 1 (bias term obtained by training)] × 6 = 12
- Layer 3: Similarly, this layer is a convolutional layer that has the same grouping as the first layer, the only difference is that this layer has 16 convolutional kernels instead of 6, so the input size 14*14*6 inherited from the previous level goes through this layer and the output layer is 10*10*16. The number of parameters is 5*5*16+16=416
- Layer 4: Again, similar to layer 2, this time there are 16 nuclei in the pooling layer. Keep in mind that the output also goes through the sigmod activation function. The input size 10*10*16 inherited from the previous stage passes through this pooling layer and will result in a subsample of 5*5*16 . The number of parameters is (1+1)*16=32
- Layer 5: This time the convolution layer uses 120 5*5 convolution kernels. Since the input size happens to be 5*5*16, we don't even have to consider the step size to get the output size of 1*1*120.There are 5*5*120 = 3000 parameters in this layer
- Layer 6: This is a fully connected layer with 84 parameters. So, the input of 120 units will be converted into 84 units. Thus, there are 84*120+84=10164 parameters. More than one activation function is used here. To be sure, you can use any alternative activation function you want, as long as it makes the problem easy
- Output layer: the final layer is a 10-cell fully connected layer with 84*10+10=924 parameters

I recommend that the cross-entropy loss function and the softmax activation function be used in the final layer, and I will not go into the details of the loss function and the reasons for using it here. Please use different training plans and learning rates for your training.

from keras import layers from keras.models import Model def lenet_5(in_shape=(32,32,1), n_classes=10, opt='sgd'): in_layer = layers.Input(in_shape) conv1 = layers.Conv2D(filters=20, kernel_size=5, padding='same', activation='relu')(in_layer) pool1 = layers.MaxPool2D()(conv1) conv2 = layers.Conv2D(filters=50, kernel_size=5, padding='same', activation='relu')(pool1) pool2 = layers.MaxPool2D()(conv2) flatten = layers.Flatten()(pool2) dense1 = layers.Dense(500, activation='relu')(flatten) preds = layers.Dense(n_classes, activation='softmax')(dense1) model = Model(in_layer, preds) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) return model if __name__ == '__main__': model = lenet_5() print(model.summary())

In 2012, Hinton's deep neural network competed in the imagenet, the world's most important computer vision challenge, and reduced the top-5 loss from 26% to 15.3%, a result that wowed the world.

This neural network is much like LeNetg, but deeper than it, with about sixty million parameters.

* Using deep convolutional neural networks to participate in ImageNet*

This calculation process does look a bit scary. This is because the network consists of two halves, each of which is trained on two different GPUs. Let's make the process easier by illustrating this with a condensed version of the diagram.

This structure consists of 5 convolutional layers and 3 fully connected layers. All eight layers also used two new concepts of the time - maximum pooling and Relu activation - to provide an advantage to the model.

You can find the different layers and their corresponding configurations in the diagram above. Each layer is described in the following table.

: The Relu activation function is used in all convolutional layers except the final softmax layer and the output part of the fully connected layer.note

The authors also use many other techniques (not all of which will be discussed in this post) - such as dropout, augmentatio, and momentum stochastic gradient descent.

from keras import layers from keras.models import Model def alexnet(in_shape=(227,227,3), n_classes=1000, opt='sgd'): in_layer = layers.Input(in_shape) conv1 = layers.Conv2D(96, 11, strides=4, activation='relu')(in_layer) pool1 = layers.MaxPool2D(3, 2)(conv1) conv2 = layers.Conv2D(256, 5, strides=1, padding='same', activation='relu')(pool1) pool2 = layers.MaxPool2D(3, 2)(conv2) conv3 = layers.Conv2D(384, 3, strides=1, padding='same', activation='relu')(pool2) conv4 = layers.Conv2D(256, 3, strides=1, padding='same', activation='relu')(conv3) pool3 = layers.MaxPool2D(3, 2)(conv4) flattened = layers.Flatten()(pool3) dense1 = layers.Dense(4096, activation='relu')(flattened) drop1 = layers.Dropout(0.5)(dense1) dense2 = layers.Dense(4096, activation='relu')(drop1) drop2 = layers.Dropout(0.5)(dense2) preds = layers.Dense(n_classes, activation='softmax')(drop2) model = Model(in_layer, preds) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) return model if __name__ == '__main__': model = alexnet() print(model.summary())

Runner-up in the 2014 IMAGENET Challenge. Because this unified architecture is so lightweight, many newcomers use it as a simple form of deep convolutional neural network.

In the following article, we will learn how one of the most commonly used network architectures extracts features from images (extracting image information to transform it into a low-dimensional array containing important information about the image)

VGGNet has two simple rules of thumb that need to be followed.

- Each convolutional layer is configured with kernel size = 3×3, stride = 1×1, padding = same. The only difference is the number of nuclei.
- Each maximum pooling layer is configured with windows size = 2×2 and stride = 2×2. Therefore, we reduce the image size to half at each pooling layer.

The input is a 224*224 RGB image, so the input size is 224x224x3

The total number of parameters is 138,000,000.Most of these parameters come from the fully connected layer.

- The first fully connected layer contains 4096 * (7 * 7 * 512) + 4096 = 102,764,544 parameters
- The second fully connected layer contains 4096 * 4096 + 4096 = 16,781,312 parameters
- The third fully connected layer contains 4096 * 1000 + 4096 = 4,100,096 parameters

The fully connected layer contains a total of 123,645,952 parameters.

from keras import layers from keras.models import Model, Sequential from functools import partial conv3 = partial(layers.Conv2D, kernel_size=3, strides=1, padding='same', activation='relu') def block(in_tensor, filters, n_convs): conv_block = in_tensor for _ in range(n_convs): conv_block = conv3(filters=filters)(conv_block) return conv_block def _vgg(in_shape=(227,227,3), n_classes=1000, opt='sgd', n_stages_per_blocks=[2, 2, 3, 3, 3]): in_layer = layers.Input(in_shape) block1 = block(in_layer, 64, n_stages_per_blocks[0]) pool1 = layers.MaxPool2D()(block1) block2 = block(pool1, 128, n_stages_per_blocks[1]) pool2 = layers.MaxPool2D()(block2) block3 = block(pool2, 256, n_stages_per_blocks[2]) pool3 = layers.MaxPool2D()(block3) block4 = block(pool3, 512, n_stages_per_blocks[3]) pool4 = layers.MaxPool2D()(block4) block5 = block(pool4, 512, n_stages_per_blocks[4]) pool5 = layers.MaxPool2D()(block5) flattened = layers.GlobalAvgPool2D()(pool5) dense1 = layers.Dense(4096, activation='relu')(flattened) dense2 = layers.Dense(4096, activation='relu')(dense1) preds = layers.Dense(1000, activation='softmax')(dense2) model = Model(in_layer, preds) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) return model def vgg16(in_shape=(227,227,3), n_classes=1000, opt='sgd'): return _vgg(in_shape, n_classes, opt) def vgg19(in_shape=(227,227,3), n_classes=1000, opt='sgd'): return _vgg(in_shape, n_classes, opt, [2, 2, 4, 4, 4]) if __name__ == '__main__': model = vgg19() print(model.summary())

It uses an inception module, a novel concept with a smaller convolution that reduces the number of parameters to just 4 million.

* Inception module*

Use these Inception module the reasons for the：

- Each layer of classes extracts different information from the input. A 3×3 layer will collect different information than a 5×5 layer will collect. How do we know which transformation is the best in a given layer? So we use them all!
- Dimensionality reduction using 1×1 convolution! Consider an input of 128x128x256. If we pass the input through 20 filters of size 1×1, we will get an output of 128 x128x20. We applied them before a 3 × 3 or 5 × 5 convolution to reduce the number of input filters on the Inception block layer used for dimensionality reduction.

GoogLeNet/Inception - Architecture

The complete inception architecture.

* In-depth understanding of convolution*

You may see some "auxiliary classifiers" with softmax in this structure. Quoting from this paper - "By adding auxiliary classifiers connected to these intermediate layers, we expect to enhance discrimination at lower stages of the classifier, increase the gradient signal being propagated back, and provide additional regularization. "

But what does that mean? What they mean is:

- Low-stage recognition: we will train the lower layers of the network with gradients derived from the output probabilities of earlier stages. This ensures that the network has some ability to recognize different objects at the beginning of the phase.
- Increasing the back-propagation gradient signal: in deep neural networks, usually the back-propagation gradient becomes so small that it is difficult to learn in the first few layers of the network. Therefore, it is helpful for the early classification layer to train the network by propagating a strong gradient signal.
- Provides additional regularization: deep neural networks tend to overfit the data (or result in high variance), while small neural networks tend to underfit the data (or result in high bias). Early classifier specification of deep overfitting effects!

Auxiliary classifier structure.

Note: Here #1×1 stands for the filter in the 1×1 convolution in the Inception module. The #3×3 simplification (reduce) represents the filter in the 1×1 convolution before the 3×3 convolution in the Inception module. The #5×5 simplification (reduce) represents the filter in the 1×1 convolution before the 5×5 convolution in the Inception module. The #3×3 represents the filter in the 3×3 convolution in the Inception module. The #5×5 represents the filter in the 5×5 convolution in the Inception module. The pool item (pool proj) represents the filter in the 1×1 convolution before the largest pool in the inception module.

* GoogLeNet is a typical Inception architecture*

It uses batch normalization, image distortion, and RMSprop, which we will discuss in a future article.

The top-5 error rate in the 2015 imagenet challenge was around 3.57%, which is lower than the human top-5 error rate. This is all thanks to Microsoft's use of ResNet ( Residual Network) in the competition. This network proposes a completely new approach: "jump connections"

* Residual learning: a module*

Residual networks provide a solution to such a phenomenon - as we keep deepening the neural network, the performance of the deep neural network gets worse. But intuitively, it seems like something that shouldn't happen. If the performance of a network of depth K is measured in terms of y, then it is only right that a network of depth K+1 should perform at least as well as y.

This phenomenon brings up the hypothesis that direct mappings are hard to learn. So, instead of learning the mapping between the output and input layers of the network, the difference between them - the residuals - is learned.

For example, let x be the input and H(x) be the learned output. We have to learn that F(x) = H(x) - x. We can first use one layer to learn F(x) and then add x to F(x) to obtain H(x). As a result, we send H(x) to the next level, as we did before. This is the residual block we saw earlier.

The results are stunning, and this is because the gradient disappearance problem that caused the neural network to fail to learn is eliminated. A jump connection, or "shortcut", gives a shortcut to obtain the gradient of the previous layers of the network, skipping the layers in between.

Let's use it here.

In this paper, we propose deeper ResNets- 50/101/152 using bottlenecks. The neural network uses a 1×1 convolution to increase and decrease the dimensionality of the number of channels, rather than using the residual blocks mentioned above.

from keras import layers from keras.models import Model def _after_conv(in_tensor): norm = layers.BatchNormalization()(in_tensor) return layers.Activation('relu')(norm) def conv1(in_tensor, filters): conv = layers.Conv2D(filters, kernel_size=1, strides=1)(in_tensor) return _after_conv(conv) def conv1_downsample(in_tensor, filters): conv = layers.Conv2D(filters, kernel_size=1, strides=2)(in_tensor) return _after_conv(conv) def conv3(in_tensor, filters): conv = layers.Conv2D(filters, kernel_size=3, strides=1, padding='same')(in_tensor) return _after_conv(conv) def conv3_downsample(in_tensor, filters): conv = layers.Conv2D(filters, kernel_size=3, strides=2, padding='same')(in_tensor) return _after_conv(conv) def resnet_block_wo_bottlneck(in_tensor, filters, downsample=False): if downsample: conv1_rb = conv3_downsample(in_tensor, filters) else: conv1_rb = conv3(in_tensor, filters) conv2_rb = conv3(conv1_rb, filters) if downsample: in_tensor = conv1_downsample(in_tensor, filters) result = layers.Add()([conv2_rb, in_tensor]) return layers.Activation('relu')(result) def resnet_block_w_bottlneck(in_tensor, filters, downsample=False, change_channels=False): if downsample: conv1_rb = conv1_downsample(in_tensor, int(filters/4)) else: conv1_rb = conv1(in_tensor, int(filters/4)) conv2_rb = conv3(conv1_rb, int(filters/4)) conv3_rb = conv1(conv2_rb, filters) if downsample: in_tensor = conv1_downsample(in_tensor, filters) elif change_channels: in_tensor = conv1(in_tensor, filters) result = layers.Add()([conv3_rb, in_tensor]) return result def _pre_res_blocks(in_tensor): conv = layers.Conv2D(64, 7, strides=2, padding='same')(in_tensor) conv = _after_conv(conv) pool = layers.MaxPool2D(3, 2, padding='same')(conv) return pool def _post_res_blocks(in_tensor, n_classes): pool = layers.GlobalAvgPool2D()(in_tensor) preds = layers.Dense(n_classes, activation='softmax')(pool) return preds def convx_wo_bottleneck(in_tensor, filters, n_times, downsample_1=False): res = in_tensor for i in range(n_times): if i == 0: res = resnet_block_wo_bottlneck(res, filters, downsample_1) else: res = resnet_block_wo_bottlneck(res, filters) return res def convx_w_bottleneck(in_tensor, filters, n_times, downsample_1=False): res = in_tensor for i in range(n_times): if i == 0: res = resnet_block_w_bottlneck(res, filters, downsample_1, not downsample_1) else: res = resnet_block_w_bottlneck(res, filters) return res def _resnet(in_shape=(224,224,3), n_classes=1000, opt='sgd', convx=[64, 128, 256, 512], n_convx=[2, 2, 2, 2], convx_fn=convx_wo_bottleneck): in_layer = layers.Input(in_shape) downsampled = _pre_res_blocks(in_layer) conv2x = convx_fn(downsampled, convx[0], n_convx[0]) conv3x = convx_fn(conv2x, convx[1], n_convx[1], True) conv4x = convx_fn(conv3x, convx[2], n_convx[2], True) conv5x = convx_fn(conv4x, convx[3], n_convx[3], True) preds = _post_res_blocks(conv5x, n_classes) model = Model(in_layer, preds) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) return model def resnet18(in_shape=(224,224,3), n_classes=1000, opt='sgd'): return _resnet(in_shape, n_classes, opt) def resnet34(in_shape=(224,224,3), n_classes=1000, opt='sgd'): return _resnet(in_shape, n_classes, opt, n_convx=[3, 4, 6, 3]) def resnet50(in_shape=(224,224,3), n_classes=1000, opt='sgd'): return _resnet(in_shape, n_classes, opt, [256, 512, 1024, 2048], [3, 4, 6, 3], convx_w_bottleneck) def resnet101(in_shape=(224,224,3), n_classes=1000, opt='sgd'): return _resnet(in_shape, n_classes, opt, [256, 512, 1024, 2048], [3, 4, 23, 3], convx_w_bottleneck) def resnet152(in_shape=(224,224,3), n_classes=1000, opt='sgd'): return _resnet(in_shape, n_classes, opt, [256, 512, 1024, 2048], [3, 8, 36, 3], convx_w_bottleneck) if __name__ == '__main__': model = resnet50() print(model.summary())

- Gradient-based learning applied to document recognition
- Gradient learning based target recognition
- ImageNet classification using deep convolutional neural networks
- Ultra-Deep Convolutional Networks for Large-Scale Image Recognition
- Further in-depth convolution
- Deep residual learning for image recognition

Want to continue to see links and references to that article?

Long click on the link to open or click on the bottom [read original].

https://ai.yanxishe.com/page/TextTranslation/1249