During that time at the end of last year, I looked at the algorithmic thinking of many of the high scorers in the Tenchi competition and roughly summarized some of the core processes and important details in supervised learning.
feature processing tricks
This is a cliche, but I still see some good points, like eliminating highly missing cases based on high importance feature these and so on
single feature + Crossing feature
Crossing features to combine original features can significantly improve the auc and increase the accuracy of the hits, over here, besides FM, we can also go for this trick in the regular algorithm
Supervised learning architecture ideas
Here's a look at exactly how this is achieved for each point, and what we need to be aware of in relation to it.
feature processing tricks
case and feature selection
We usually do some censoring of the model FEATURES before we do model training, such as covariance test to remove continuous FEATURES with too much similarity; such as variability test to remove some FEATURES with too little variation in data, etc. However, in the conventional sample processing, we usually only look at the initial data distribution, such as the user in the feature missing greater than to come to a certain threshold before going back to eliminate the user; in fact, in deeper thinking about this issue, will find that if the user in the high importance of the feature missing high degree to eliminate some more reasonable, so think may not be very clear, this side look at the following feature flow.
For uid, if the ordinary statistics, the number of null of uid3 is 5, and the number of null of uid5 is 4. We should eliminate uid3 first, and then consider eliminating uid5, because the amount of information provided by users with too much null will be relatively small, which will increase the generalization error.
But if we know in advance， For judgmentlabel of capacity，feature3>feature5>feature6>feature8> other， or souid5 In high importance of The absence is extremely serious inuid3， So we should prioritize eliminatinguid5， As opposed to one of the above scenarios， We know in advancefeature of The order of importance becomes important。 Several simple judgments are provided on how to determine of approach：
variance expansion coefficient： We believe that， After normalizing the data， Data fluctuations of greater offeature Able to provide of It's also relatively more informative of， To name the obvious. of examples， in casefeature1 All of them.1 of talk， It means nothing to us in determining whether or not a user places an order as a result。
mutual information： I always thought， Mutual information is a judgmentfeature value of rare of approach。 The variance expansion factor simply considersfeature themselves of features， And mutual information is considered infeature of It also takes into accountlabel among of relations，H(X,Y) = H(X) - H(X/Y)， This amount of information of Nice formula. of This is explained。
xgb's importance: If mutual information is a progression of the variance inflation factor, then xgb's importance is a progression of mutual information, which considers the relationship between label and FEATURE while also considering the relationship between FEATURE and FEATURE, so that the resulting importance ranking is a bit more comprehensive.
There are many methods such as this, which need to be considered according to the form of data, the form of target variables, time cost, efficiency, etc. This is just to give you a comb of the conventional methods, as for the actual use of the situation, you need to accumulate project experience.
null-feature treatment method
On the handling of null or outliers, there are basically 2 schools of thought, either eliminating this FEATURE or case or filling this FEATURE or case, their disadvantages are also obvious, random elimination will reduce the information for judgment, and if the data is small, it will reduce the effectiveness of the model; filling will cause confusion, what is the plurality? Average? Median? Maximum? Minimum? Nowadays, many people deal with this by looking at the distribution of the data and considering quantile padding if it is skewed, or mean or plurality padding if it is normal, which can be relatively more costly in terms of time to deal with, and many times the explanations are not very convincing.
I read the high-scoring answers to JD's order estimates in March 17, the Tianchi Industrial Race for 17 years, and so on, and I have to say that there is a box approach that does improve the auc by 0.5-1.5, and I've thought about possible reasons before:
The original information is saved without changing the real data distribution by filling or deleting it
Make the form in which FEATURES exist make more sense, for example the field AGE, it's not really a difference like 27 or 28 that we care about, it's a difference like post 90, post 80, and if it doesn't take the form of a split box, it somewhat exaggerates the difference between 27 and pre 26
In the data calculation, not only the speed of the calculation is accelerated but also the random deviation in the actual data recording is eliminated and the noise that may occur during the storage process is smoothed
I'll share with you directly from here of combability：