Convolutional neural network training simulation for quantitative practice

Deep learning is becoming more and more widely used in mobile, and since there is still a gap in computing power in mobile compared to servers, the

So the difficulty in deploying deep learning models on mobile lies in how to ensure that the models are effective while having guaranteed operational efficiency.

In the experimental phase for the model structure a large model can be chosen, because that phase is mainly to verify the validity of the method. in verifying

Once that's done, start working on deploying to mobile, and it's time to streamline the structure of the model, generally for the large model that's trained

Do pruning, or refer to existing lightweight networks such as MobileNetV2 and ShuffleNetV2 to redesign your own

Network Module. And the algorithm-level optimization has quantization in addition to pruning. Quantization is the process of taking the weights of the floating-point (high-precision) representation and

The activation values are approximated by lower precision integers. The advantages of low precision are that, compared to high precision arithmetic operations, its unit

more data can be processed in time and the storage space of the model can be further reduced after quantization of weights, etc. [1].

To do quantization on trained networks, the post-training quantization algorithm of TensorRT [5][8] has been tried in practice and works quite well.

But if you can go through the training process to simulate the quantization process and let the network learn to correct for the errors caused by the quantization.

Then the obtained quantization parameters should be more accurate and the performance loss of the model should be able to be smaller in the actual quantization inference.

And the content of this paper is to present the paper [3][4] and reproduce some details of its process.

As usual, the code for the experiments in this paper is given first.TrainQuantization

Let's first look at the specific definition of quantization, for quantizing activation values to signed 8bit integers, the definition given in the paper is as follows.

The triangle in the formula represents the quantized scaling factor, and x represents the floating point activation value, first by dividing by the scaling factor and then the nearest

The neighborhood is rounded and then the range is restricted to an interval, e.g. quantize to signed 8bit, then the range is [-128, 127].

And another little trick for the weights is to quantize to [-127, 127].

Exactly why this was done is stated in the paper as an optimization for implementation, and a detailed explanation can be found in Appendix B of the paper [3]

The subsection ARM NEON details.

And training quantization is frankly to simulate the process of quantization in the forward stage, which is essentially the quantization of weights and activation values to 8bit

Then inverse quantization back to 32bit with error, so the training is still floating point, and the backward phase is the gradient of the summation of the weights after the simulated quantization.

This gradient is then used to update the pre-quantization weights. Then the process continues in the next batch, and by doing so the network can learn to

Go ahead and correct for the errors introduced by quantization.

This schematic given above is a good visual representation of the process of analog quantization, for example, the line above shows the pre-quantization

range [rmin, rmax], and then the line below represents the range [-128, 127] after quantization, for example, now to do

To simulate the quantized forward, look first at the line above counting the 4th dot from left to right, which maps by dividing by the scaling factor after

A floating point number between 124 and 125 is then taken to 125 by nearest neighbor rounding, and then returned by multiplying by the scaling factor

The fifth dot above, and finally just replace the original go-forward with this number with error. The forward phase of the simulation is quantified with

The formula is expressed as follows.

The formula for finding the gradient in the backward phase is expressed as follows.

For the calculation of the scaling factor, the weights are not the same as the activation values, the weights are calculated each time FORWARD directly on the weights

Find the absolute value to take the maximum value, then scale the factor weight scale = max(abs(weight)) / 127. Then for the activation value, the

Slightly differently, the quantized range of activation values is not simply calculated as a maximum value, but by

EMA (exponential moving averages) goes over this quantized range in training and updates the formula as follows.

moving_max = moving_max * momenta + max(abs(activation)) * (1- momenta)

The activation in the formula indicates the activation value of each batch, and the paper says that the momenta should be a number close to 1.

In the experiment I am taking 0.95. Then the scaling factor activation scale = moving_max /128.

In the implementation I did not quantize to unsigned 8bit as per the paper, but signed 8bit, firstly because unsigned 8bit

The quantization requires the introduction of additional zeros, adding complexity, and secondly in practice are quantized to signed 8bit. Then in the paper

mentions that for the quantized subchannels of the weights a scaling factor is found, and then for the quantized whole of the activation values a scaling factor is found, such that

Best results. In practice, I found that some task weights are good without channel quantification, and this still depends on the specific task, but this paper

The experimental code given is unscored.

Then for networks with batchnorm after the convolutional layer, because generally in the practical use phase, for speed optimization, the batchnorm

parameters are fused into the parameters of the convolutional layer in advance, so the process of training simulated quantization follows this process. first put

The parameters of the batchnorm are fused with the parameters of the convolution layer, and then quantization is done on this parameter. The following two images each represent the training process

Differences from the treatment of the batchnorm layer during practical applications.

For how to fuse the batchnorm parameters into the convolution layer parameters, look at the following equation.

in the formula, W and b denote the weight and bias of the convolutional layer, respectively, and x and y are the input and output of the convolutional layer, respectively, then according to the calculation of bn

formula, one can introduce the weights and biases after fusing the batchnorm parameters, Wmerge and bmerge.

In my experiments I actually simplified the process of fusing batchnorm, which would have been much more complicated to implement exactly as it is in the paper.

And it is based on an already trained network to do simulated quantitative experiments, not based on a pre-trained model that can't be trained and may have potholes to step through.

Moreover, the batchnorm layer parameters are fixed during the simulated quantization training, and the fused batchnorm parameters are also moved using the already trained

mean and variance, rather than using the mean and variance of each batch.

The specific implementation follows this example diagram of a simulated quantized convolutional layer from the paper to write the training network structure.

Experimented with VGG on Cifar10 and the results were OK, as it was to verify the effectiveness of quantitative training, so the training

Cifar10 didn't do much parametric tuning and data augmentation, and the highest accuracy of the trained model was only 0.877, which is better than the best

The result 0.93 is quite a bit worse, and then the simulation quantization is done based on this 0.877 model to get the same accuracy as the normal training basically

same model, it may be that this classification task is simpler. Then after obtaining the trained model with quantization factors for each layer, it is possible to

simulates the real quantized inference process, but since MXNet's convolutional layer does not support integer operations, the simulation is also done with floating point

to simulate, the details of the implementation can be seen in the sample code.

The above is a blog based on the summary of some recent work practices to get a lot of places for the realization of the thesis is my own

Personal understanding, so if any readers find something wrong or questionable, please point it out as well, so we can share and learn from each other :).

[1] 8-Bit Quantization and TensorFlow Lite: Speeding up mobile inference with low precision

[2] Building a quantization paradigm from first principles

[3] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

[4] Quantizing deep convolutional networks for efficient inference: A whitepaper

[5] 8-bit Inference with TensorRT

[6] TensorRT(5)-INT8 calibration principle

[7] caffe-int8-convert-tool.py