Machine Learning Battle (7): logistic regression

Hello, everyone! I'm the happy old boy from MPIG**Yangpu district of Shanghai** , today I introduce you to logistic regression.

This will be an exciting chapter as we will be exposed to optimization algorithms for the first time. If you think about it, you will find that we have actually encountered many optimization problems in our daily lives, such as how to get from point A to point B in the shortest possible time? How do you put in the least amount of work but get the most out of it? How do you design an engine to make the least amount of fuel consumption and the most power? As you can see, optimization is very powerful. Next, we present several optimization algorithms and use them to train a nonlinear function for classification.

Suppose now we have some data points and we fit these points with a straight line (the line is called a best-fit line), this fitting process is called regression. The main idea of classification using logistic regression is to create a regression formula for the classification boundary line based on the available data as a way to classify. The term "regression" here is derived from best-fit, which means finding the best-fitting set of parameters, and the mathematical analysis behind it will be described in the next section. The practice when training the classifier is to find the best-fit parameters, using an optimization algorithm.

The function we want should be able to both accept all the inputs that are available and then predict the category. For example, in the case of two classes, the above function outputs 0 or 1. We therefore introduce the Sigmoid Function.

Classifier inputs and optimal coefficients.

With such a classification model in place, we need to use an optimization algorithm to find W. Here we present to you the gradient ascent algorithm. The gradient ascent method is based on the idea that the best way to find the maximum value of a function is to probe along the direction of the gradient of that function. If the gradient is denoted as ∇, the gradient of the function f(x,y) is given by

The gradient ascent algorithm requires traversing the entire dataset each time the regression coefficients are updated, and the method is fine when dealing with a dataset of about 100, but with billions of samples and thousands of features, the computational complexity of the method is too high. One improved method is to update the regression coefficients one sample point at a time; this method is called the stochastic gradient ascent algorithm. The stochastic gradient ascent algorithm is an online learning algorithm since the classifier can be updated incrementally as new samples arrive. As opposed to "online learning", processing all the data at once is called "batch processing".

With this stochastic gradient ascent algorithm, the weight coefficients tend to converge as the number of iterations increases.

But there are large fluctuations in each parameter, which arise because there are sample points that cannot be classified correctly (the dataset is not linearly divisible) and will trigger dramatic changes in the coefficients at each iteration. We expect the algorithm to avoid fluctuations back and forth and thus converge to some value. Also, the rate of convergence needs to be accelerated. We then introduce the improved stochastic gradient ascent algorithm.

It can be seen that the number of iterations is significantly reduced and becomes less volatile. Also, this chapter introduces an example of predicting mortality in a sick horse from a hernia condition at the end, so if you want to study it in detail, open the video below!

** To access the corresponding script and code for this presentation, you can download it by clicking on the following link.**