Introductory Machine Learning Series 02, Regression regression: a case study

Citation Course.http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html

Look here first, probably because you are looking at this platform interlinear formulas do not support many renderings, so it's best to look on my CSDN, portal: (helpless face)

CSDN blog post at.http://blog.csdn.net/zyq522376829/article/details/66577532

Approaching deep learning directly without a better foundation in mathematics can be very abstract, so here we start by opening the door to deep learning with a case study of predicting the Combat Power (CP) value of Pokemon Go.

For example, estimating the CP (battle power) of a Pokémon after it evolves. Here's a Mythical Frog Seed that can evolve into Mythical Frogweed, which now has a CP of 14, and we want to estimate what the CP of the evolved Pokémon will be; the advantage of requiring candy to evolve is that if it's not happy with the CP after evolving, then you don't have to waste candy to evolve it and can choose a cost-effective Pokémon.

The input uses a number of different $x$ to represent different attributes, for example combat power is represented by $x_{cp}$, species $x_{s}$ is represented by... The output is the evolved CP value

The previous post mentioned the three steps of machine learning: Step1.Identify a set of functions (Model). Step2.Train the training set on the function set. Step3. Pick the "best" function $f^{* }$ and then you can use $f^{* }$ to perform tests on the new test set.

What should this mod look like? Let's write a simple one: we can think of the evolved CP value $y$ as equal to the pre-evolutionary CP value $x_{cp}$ multiplied by an argument $w$ plus an argument $b$.

$w$ and $b$ are parameters and can be any value.

There can be

This function sets There can be infinite function。 So we use $y = b + w cdot x_{cp} $ Representing these function The set formed by。 And then there are, for example, the ones above $f_{3}$ ， Obviously not true.， owing toCP The values have a condition that is all positive， multiply that by $-1.2$ Then it becomes negative.， So we then have to find the training set based on， this one function set interior， Which is reasonable function。

We refer to equation 1-1 as the Linear model, and the Linear model takes the form

$x_{i}$ is the various different attributes of the Pokémon, height, weight, etc. We call these "FEATURES"; $w_{i}$ is called weight and b is called bias.

Now it's time to collect the training set, here the dataset is Supervised, so the input and output (values) of the function are needed, for example a Genie turtle was caught with a pre-evolutionary CP value of 612, using $x^{1}$ to represent the pre-evolutionary CP value of this Genie turtle, i.e. the number of a complete object with a superscript; the post-evolutionary CP value is 979, using $hat{y}^{1}$ to represent the post-evolutionary CP value, using hat (the superscript symbol on top of the letter head) to indicate that this is a correct value, the output that the function should actually be observed to have.

Here we look at the real dataset (source Source: https://www.openintro.org/stat/data/?data=pokemon）

Let's look at the real stats of the 10 Pokémon, with the $x$-axis representing the pre-evolution CP and the $y$-axis representing the post-evolution CP.

With the training set, in order to evaluate how good the function is, we need to define a new function called ** Loss function (loss function)** , defined as follows.

Loss function $L$ :

input: a function, output: how bad it is

The Loss function is a rather special function, a function of a function, because its input is a function, and the output is an indication of how bad the input function is. It could be written in the following form.

A loss function is a set of parameters w harmonyb decided， So it can be said that the loss function is measuring a set of parameters that are good or bad。

The more common form of the definition is used here.

The actual value $hat{y}^{n}$ is subtracted from the estimated value $b + wcdot x_{cp}^{n}$ and then given the square, which is the Estimation error (estimation error, total deviation); finally, the estimation error is added up to the loss function we defined.

The reason for not taking the algebraic sum of the individual deviations $sum_{n=1}^{10}hat{y}^{n} -(b + wcdot x_{cp}^{n})$ as the total deviation here is that these deviations ($hat{y}^{i} -(b + wcdot x_{cp}^{i})$) themselves have positive and negative values and may cancel each other out if simply taken as their algebraic sum, which is a guarantee that the individual deviations are small even though the algebraic sum of the deviations is small. So following equation 1-2, it is the smallest sum of the squares of these deviations that will ensure that each deviation is small.

For more intuition, to graph the loss function.

Each point on the graph represents an equation, for example the red one represents $y=-180-2cdot x_{cp}$ . The color represents how bad the loss function is obtained using the equation for this point, with the more reddish the color, the larger the value, and the more blue-blue, the better the equation. The best equation is the point marked by the fork in the graph.

Having set the loss function, you can measure how good each equation is, and next you need to pick the best one from the set of functions. To mathematize the process.

Due to the specificity of the example given here, for Eqs. 1-3, the optimal w and b can be solved directly using least squares, which minimizes the total deviation.

A quick note.

least square (estimate)， For a binary function $f(x,y)$， The extreme value point of the function must be $frac{partial f}{partial x}$ and$frac{partial f}{partial y}$ points that are simultaneously zero or have at least one partial derivative that does not exist； This is a necessary condition for extreme values。 Using this extreme value condition one can solvew harmony b。（ For more information, please refer to《 mathematical analysis， Lower third edition， Euromedium etc. editors》 Chapter XV， Section I）

But a different approach will be used here that** Gradient Descent** The most rapid descent method does not only solve Eqs. 1-3; in fact, as long as $L$ is differentiable, it can be handled by the most rapid descent method.

A brief look at the gradient descent approach.

Consider a loss function with only one parameter $w$, pick a random initial point, compute the differentiation of $L$ with $w = w^{0}$ with respect to $w$, and then change the value of $w$ in the direction of the decreasing tangent line (since this is a minimal value), i.e., a negative slope increases $w$; a positive slope decreases $w$ .

Then each time you change $w$, how much to change, use $eta frac{mathrm{d}L}{mathrm{d}w} | _{w=w^{0}}$ means that $eta$ is called the "learning rate". |
---|

Since the slope is negative here, it is $w^{0} - eta frac{mathrm{d}L}{mathrm{d}w} | _{w=w^{0}}$ to get $w^{1}$; then it's just a matter of repeating the above steps. |
---|

until a point is found which has a slope of 0. But the case in the example will be more puzzling, such a method is likely to find only local extremes and not global extremes, but this is due to our example, for regression problems there are no local extremes, only global extremes. So this method can still be used.

Here's a look at the two parameters.

The difference between the two parameters is that you need to find the partial differentiation for both parameters each time and then update the values of the parameters in the same way.

For the gradient see Mathematical Analysis, Third Edition, lower volume, edited by Ouyang Zhong et al, Chapter 14, Section 6. Or you can just look at it. Baidu online encyclopedia or perhapswikipedia

To visualize the above practices.

The deficiency of the same gradient descent is shown below.

It may be that only local extremes are found, but for linear regression, it is guaranteed that the selected loss function Eqs. 1-2 are convex (convex, i.e., only unique extremes exist). The contour plot of the loss function is on the right side of the graph above, and you can see that it decreases inward one circle at a time.

The result is plotted as follows

The sum of the absolute values of the deviations on the training set can be calculated as 31.9

But the real concern is not the bias on the training set, but the Generalization case, which is the need to calculate the bias on a new data set (test set). As shown below.

The sum of the absolute values of the deviations was calculated using the data from ten new Pokémon as the test set as 35.

Next consider whether it could be done better, perhaps not simply in a straight line, and consider the case of other mods.

For example, redesigning a mod with one more quadratic term to find the parameters gives an Average Error of 15.4, which looks better on the training set. The Average Error derived on the test set is 18.4, which is indeed the better Mod.

Consider three more terms.

The results obtained look slightly better than the results obtained with the quadratic term. You can also see that $w_{3}$ is already very small, indicating that the three-time term has little effect.

Consider again the four subterms.

At this point it could have done better on the training set, but the results on the test set got worse.

Consider five more subparagraphs.

You can see that the results of the test set are very poor.

Plotting the change in Average Error on the training set.

You can see that the Average Error on the training set gets progressively smaller.

With those mods above, the high sub term is the function that contains the low sub term. It is true that theoretically higher powers of more complex equations can give lower results for the training set. But adding the results of the test set.

The observation yields the result that while more complex mods can give better results on the training set, more complex mods do not necessarily give good results on the test set. This conclusion is called "** Overfitting**”。

If you have to choose a mod at this point, the best choice would be a mod for the cubic term equation.

Typical of real life learning to drive, people can do fine in the training set at driving school when learning to drive, but the real test set once on the road is completely unmanageable. Here is just an example where the training set was great and the test set turned out poorly ^_^

Consider the data of 60 Pokémon

You can see that species is also a key factor, and considering only pre-evolutionary CP values is too limiting, and the mod just now is not well designed.

The new mod is as follows

Write this model in the form of a LINEAR MODEL.

To see the results of doing so.

Different kinds of Pokémon use different parameters, distinguished by color. At this point the mod could have done better on the training set, and the results on the test set were better than the previous 18.1.

For example, to plot height, weight, and vitality.

Redesign mod.

Consider the upper life value ($x_{hp}$), height ($x_{h}$), and weight ($x_{w}$)

With such a complex mod, theoretically better results could be obtained on the training set, which is actually 1.9, which is indeed lower. But then the results of the test set are overfitted.

For the case where the results of so many parameters above are not ideal, the regularization process is performed here by modifying the previous loss function to.

An extra term is added to Eqs. 1-5: $lambda sum (w_{i})^{2}$ , concluding that the smaller $w_{i}$, the better the equation (Eqs. 1-4). It can also be said that when $w_{i}$ is smaller, then the equation is smoother.

Smoothing means that the output is not sensitive to changes in the input when the input changes. For example, if the input is increased by $Delta x_{i}$ in Eqs. 1-5, then the input is increased by $w_{i}Delta x_{i}$, and it can be seen that as $w_{i}$ gets smaller, the output changes less significantly. Also say the test set has some noisy data as input, the smoother the equation the less it will be affected.

The figure above shows the result of tweaking $lambda$. As $lambda$ gets larger, the term $lambda sum (w_{i})^{2}$ has more influence, so as $lambda$ gets larger, the equation gets smoother.

The result obtained on the training set is that as $lambda$ gets larger, the Error obtained on the training set is larger. This is reasonable, because as $lambda$ gets larger, it tends to consider the $w$ value itself and consider error less. But the error obtained on the test set is decreasing and increasing. The smoother functions are preferred here because of the good robustness to noisy data mentioned above, so the performance is getting better and better when you start adding $lambda$; but the smoothest function is a horizontal line, which is equivalent to doing nothing, so too smooth functions will get bad results again.

So this last thing is about finding the best fit $lambda$ ， At this point the belt-in1-5 work out$b$ harmony $w_{i}$， recipientfunction It's the best.function。

For Regularization, the extra item: $lambda sum (w_{i})^{2}$, does not take $b$ into account, because the smooth function is expected, but the bias item does not affect the degree of smoothing, it just moves the function up and down, and has nothing to do with the smoothness of the function.

**Pokemon**: The original CP value greatly determines the evolved CP value, but there may be some other factors as well.**Gradient descent**: The gradient descent approach; its rationale and main points will be covered later.**Overfitting**harmony**Regularization**: Overfitting and regularization, which focuses on representations; more theory in this area will be covered later

New blog address.http://yoferzhang.com/post/20170326ML02Regression