Overview of supervised learning techniques organized

preface

During that time at the end of last year, I looked at the algorithmic thinking of many of the high scorers in the Tenchi competition and roughly summarized some of the core processes and important details in supervised learning.

feature processing tricks

This is a cliche, but I still see some good points, like eliminating highly missing cases based on high importance feature these and so on

single feature + Crossing feature

Crossing features to combine original features can significantly improve the auc and increase the accuracy of the hits, over here, besides FM, we can also go for this trick in the regular algorithm

Supervised learning architecture ideas

Here's a look at exactly how this is achieved for each point, and what we need to be aware of in relation to it.

feature processing tricks

case and feature selection

We usually do some censoring of the model FEATURES before we do model training, such as covariance test to remove continuous FEATURES with too much similarity; such as variability test to remove some FEATURES with too little variation in data, etc. However, in the conventional sample processing, we usually only look at the initial data distribution, such as the user in the feature missing greater than to come to a certain threshold before going back to eliminate the user; in fact, in deeper thinking about this issue, will find that if the user in the high importance of the feature missing high degree to eliminate some more reasonable, so think may not be very clear, this side look at the following feature flow.

For uid, if the ordinary statistics, the number of null of uid3 is 5, and the number of null of uid5 is 4. We should eliminate uid3 first, and then consider eliminating uid5, because the amount of information provided by users with too much null will be relatively small, which will increase the generalization error.

But if we know in advance， For judgmentlabel of capacity，feature3>feature5>feature6>feature8> other， or souid5 In high importance of The absence is extremely serious inuid3， So we should prioritize eliminatinguid5， As opposed to one of the above scenarios， We know in advancefeature of The order of importance becomes important。 Several simple judgments are provided on how to determine of approach：

variance expansion coefficient： We believe that， After normalizing the data， Data fluctuations of greater offeature Able to provide of It's also relatively more informative of， To name the obvious. of examples， in casefeature1 All of them.1 of talk， It means nothing to us in determining whether or not a user places an order as a result。

mutual information： I always thought， Mutual information is a judgmentfeature value of rare of approach。 The variance expansion factor simply considersfeature themselves of features， And mutual information is considered infeature of It also takes into accountlabel among of relations，H(X,Y) = H(X) - H(X/Y)， This amount of information of Nice formula. of This is explained。

xgb's importance: If mutual information is a progression of the variance inflation factor, then xgb's importance is a progression of mutual information, which considers the relationship between label and FEATURE while also considering the relationship between FEATURE and FEATURE, so that the resulting importance ranking is a bit more comprehensive.

In addition to this.

Parameters of params of logistic regression

Recursive feature Elimination (recursive parameter selection method)

Pearson Correlation

Distance correlation

Mean decrease impurity/Mean decrease accuracy

…

There are many methods such as this, which need to be considered according to the form of data, the form of target variables, time cost, efficiency, etc. This is just to give you a comb of the conventional methods, as for the actual use of the situation, you need to accumulate project experience.

null-feature treatment method

On the handling of null or outliers, there are basically 2 schools of thought, either eliminating this FEATURE or case or filling this FEATURE or case, their disadvantages are also obvious, random elimination will reduce the information for judgment, and if the data is small, it will reduce the effectiveness of the model; filling will cause confusion, what is the plurality? Average? Median? Maximum? Minimum? Nowadays, many people deal with this by looking at the distribution of the data and considering quantile padding if it is skewed, or mean or plurality padding if it is normal, which can be relatively more costly in terms of time to deal with, and many times the explanations are not very convincing.

I read the high-scoring answers to JD's order estimates in March 17, the Tianchi Industrial Race for 17 years, and so on, and I have to say that there is a box approach that does improve the auc by 0.5-1.5, and I've thought about possible reasons before:

The original information is saved without changing the real data distribution by filling or deleting it

Make the form in which FEATURES exist make more sense, for example the field AGE, it's not really a difference like 27 or 28 that we care about, it's a difference like post 90, post 80, and if it doesn't take the form of a split box, it somewhat exaggerates the difference between 27 and pre 26

In the data calculation, not only the speed of the calculation is accelerated but also the random deviation in the actual data recording is eliminated and the noise that may occur during the storage process is smoothed

I'll share with you directly from here of combability：

This side involves a question, whether continuous numeric features must be cut to discrete features, it is recommended to consider the following questions: a. Whether the algorithm used is knn, svm and other distance calculation algorithm b. Whether in practice depends on discrete judgment c. The actual significance of continuous numerical features supports discrete. If none of the above problems are a problem, I suggest giving priority to discrete continuous features, to a certain extent, discrete feature has a better explanation.

single feature + Crossing feature

We were in the previous ofFM The concept of feature crossover was mentioned in the theoretical analysis and application， while of The article is immediately followed by a matrix of calculation skill：

The form of C (n,2) of all features is constructed, and the linear model is added, which can improve the accuracy of the classification algorithm to a certain extent. This is a very good way to convert low-dimensional features to high-dimensional, so in the process of other algorithms can also learn from this idea, but suppose we have a particularly large amount of initial feature, such as my daily CTR estimate or feature combing process, it is easy to organize more than 500 feture collection, if only consider the form of C (n,2), there are 250499 new combinations of feature, this is not acceptable, So going back to the case and feature selection mentioned in our section above is a very good solution, and we can start by going through, for example, the sportance in xgboost:

My side actually of Draws the initial screening completed when I do the probability of placing an order prediction of417 sizefeature go throughxgboost After preliminary classification ofimportance， You can clearly see the front37， formerly53， formerly94 sizefeature Corresponded three timesimportance of turning point， We can choose one of these inflection points that will cover both the vast majority of of volume of information， without causing an excessive number of subsequent crossover features of happen to， For example, on my side, I choose of be60， Then I will next generate of newly of of crossoverfeature even30`*`208 It's many times smaller.， And it's not going to be much less in comparison of volume of information。

general of I've mapped out the process on my end.， Hopefully, this will give you a clearer of awareness：

As you can see， brochurecases After the initial of Null screening and first round of high importancefeature back of After filtering for null values， It'll stay the same.， And the featuresfeature of The filtering process is then carried out throughout the cross-feature generation flow。

The idea architecture of bagging and stacking

I'm sure everyone reading, whether they are machine learning practitioners or algorithm engineers or even other R&D engineers, must have seen a quick drag-and-drop flow of modules similar to the following.

It is equivalent to encapsulate each function into a fixed box, when we need to use a module, module operations, when not needed directly cut off the flow of the module can be, we can even null value each module var and bias degree of bias, in bagging and stacking framework of thought, I very often is similar to this idea: determine the combination of modules I want to carry out the combination of the way (stacking or bagging or blending), and then determine this time for what you want to do sub-module is, in accordance with the combination of the form and sub-module fine tuning each sub-module.

First of all, what can be the submodules?

svm classification/regression

logistic classification/regression

Neural network classification/regression

xgboost classification/regression

gbdt classification/regression

xgboost leaf node index

gbdt leaf node index

randomforest classification/regression

elastic net

In addition to these， what else?？ definitely， If you want. of talk， Each one you construct yourself of Classification or regression ofsingle model It can all be you.bagging perhapsstacking perhapsblending prior to of submodule。

How do I train the submodules?

To a large extent, whether you're in Tenchi or kaggle, you can be in the top 10 or outside the top 10. The deciding factor is how well your feature is handled, but whether you can get the first or tenth is largely dependent on your submodule construction and submodule combination. Over here I'll share three of the more interesting forms of submodules I've seen recently.

1.wepon ofLarge-Scale SVM

Read my previous SVM theory analysis and python implementation of this article friends should remember, I said at the time svm in 10.7% of the data set to achieve the first, is considered a traditional machine learning method is very worth learning algorithm, but the actual application, in dealing with large-scale data problems there is too long training time and memory space requirements are too large problem is more headache, wepon students take the following approach.

This approach seems to increase the computational complexity, but in fact is is reduced, assuming that the original training data size is n, the complexity of training on the original data is o(n^2), dividing the dataset n into p copies, then the amount of data per copy is (n/p), each copy training a sub-svm, the complexity is o((n/p)^2), all add up to o(n^2/p), the complexity is reduced by a factor of p than training on the original data. The variable solves the problem of using svm on a large collection of data, increasing speed while maintaining quality. The paper supports the suggestion to refer to Ensemble SVM.

2.Lovecraft Gbdts' Node Leafs

Let's explain the Dense features and Spare Features on the left and right respectively.

First of all, the left-hand side is well understood. In the last article, we already talked about how to use xgboost or gbdt to get the index value of each leaf node above the tree where the user's data falls.

If there are unclear students, please review the last time to say.

This piece on the right side says separatelyuser preference harmonyvideo content， Of course that's because it's a video company of reason， In my actual of in use， I use of beuser preference harmony item content， here ofpreference harmonycontent It's actually your personal information and behavior of Towards quantification of form。

The simplest representation is to take your basic and item information first onehotencoding, then first and last into an extra-long vector, which is a sparse Spare Features.

Of course, in addition to this brute force approach, there is also, for example, the word2vec technique in e-commerce product recommendation under deep learning that we talked about a number of days ago, where all users are first randomly generated into a one-to-one corresponding vector of the length we need under N dimensions, in the form of huffman encoding to find the unique path of each item corresponding to a Huffman tree sub, and then by generating a logsitic classification at each node, making all that path hold the highest probability, as a way to correct the N-dimensional vector that we randomly generated initially, and finally this N-dimensional vector can be seen as a Spare Features.

Any more? Of course, I privately asked my previous classmates who worked in that company, they also have an idea of dividing the dataset into M subsets, generating an xgboost on top of each subset, and then each subset takes the leaf nodes of the xgboost, which is equivalent to copying the Dense features on the left side and putting M copies of Dense features on the Spare Features on the right side, which will end up with an M+1 Dense features. The actual use of it is not at all worse than the word2vec result.

3. due toGRU of subsurface layer

Domonkos Tikk and Alexandros Karatzoglou in the article Session-based Recommendations with Recurrent Neural Networks mentions that recurrent neural networks RNNs can be used to predict the user's behavior, as follows.

We can clearly of see that， For each userSession1， other of acti1.1 Change toi1.4 It's actually an orderly of process， We can design a system fromi1.1—->i1.2,i1.2—->i1.3,i1.3—->i1.4 thus of A circular process。 Meanwhile in his of The article also explains it like this of Design solutions of Two questions.：

the length of sessions can be very different

breaking down into fragments

First, by the first and last concatenation, to solve the different length of session for different users; second, by the embedding layer, to solve the possibility of prediction under incomplete session. The specific network design is as follows.

The update flow of the model can be found below.

We just need to get the corresponding state under each user's item stream, which contains the latent information of this operation with the previous M times, and we can also define the length of this information vector at will, and this can be regarded as the user state vector as the output of the submodule.

The disadvantage of this idea is that the underlying data to be predicted is not time-series and is extremely ineffective. For example, the ordering process of dropshipping, the minimum time from login to hitting the car is in 20s and the maximum time is in 1 minute, otherwise the user exits the app, such that the temporal nature is extraordinarily weak and the user attributes obtained by forcing such RNNs are very non-representative.

How to combine submodules?

bagging

This is us.Kaggle&TianChi The analysis of pure algorithm theory related to classification problems is emphasized ofbagging of easiest of form， In each sub-module of The design selection process should be as of assurances：

low biase

high var

This means that submodules can be overfitted appropriately, increasing the degree of accuracy of the submodel fit and reducing the generalization error when averaged by weighting

stacking

This is us.Kaggle&TianChi The analysis of pure algorithm theory related to classification problems is emphasized ofstacking of easiest of form， In each sub-module1、 submodule2 of The design selection process should be as of assurances：

high biase

low var

In submodule 3, ensure that:

low biase

high var

That is, in the selection of submodules 1, 2, we need to ensure that the fit can be slightly underfit, and in the fit of submodule 3 then ensure the accuracy and strength of the fit

blending

We know that the results of individual combined sub-modules are not good enough and that if we want to get better results, we need to fuse the results of many individual sub-modules together.

This approach can also improve the effectiveness of our final prediction.