Treatment of sample scaling imbalance in machine learning

Original dry article delivered first!

Recommended reading time: 5min~12min

Main content: treatment of sample proportion imbalance in machine learning

In machine learning, the problem of unbalanced sample ratios is often encountered, e.g., for a binary classification problem, the ratio of positive to negative samples is 10:1.

This phenomenon is often dictated by the source of the data itself, as in credit card credit questions where positive samples tend to predominate. Unbalanced sample proportions often pose a number of problems, but again, the actual data obtained is often unbalanced, so this paper focuses on solutions when faced with sample imbalance.

Sample imbalance tends to cause the model to overfit classifications with a high sample size, i.e., it always sorts the samples into classifications with a high sample size; in addition to this, a typical problem is the Accuracy Paradox, which refers to a model that is highly accurate in predicting the samples, but the model has poor generalization ability.

The reason for this is that the model classifies most of the samples into the category with the higher sample size, as follows

The accuracy rate is

And if all the samples were classified as predicting a negative sample, the accuracy would rise further, but such a model is clearly bad, and in effect, the model has been overfitted to this unbalanced sample.

There are several common ideas for solving the sample imbalance problem as follows

Collecting more data

Changing the indicators for judging

Sampling the data

Synthetic samples

Changing the sample weights

1

Collecting more data

Collecting more data，thus allowing the proportion of positive and negative samples to balance， This method is often the most overlooked method， Yet in reality， When the cost of collecting data is small， This method is the most effective。

However, it is important to note that this approach does not solve the problem of unbalanced data proportions when the scenario for which data is collected is inherently unbalanced in terms of the proportion of data produced.

2

Changing the indicators for judging

Changing the indicators for judging， That is, no accuracy is used to judge and select models， The reason for this is what we mentioned above Accuracy Paradox issues。 In fact there are rubrics that specifically address the problem of judging when the sample is unbalanced， If the accuracy， recall rate，F1 happen to，ROC（AUC），Kappa etc.。

According to this article, ROC curves have the good property of not varying with the sample proportion, and therefore give a better indication of the classifier's strengths and weaknesses in the presence of unbalanced sample proportions.

More details on the rubric can be found in the article: Classification Accuracy is Not Enough: More Performance Measures You Can Use

3

Sampling the data

Sampling the data allows targeted changes to the proportion of samples in the data, and there are two general ways of sampling: over-sampling, which adds samples with a smaller number of samples in a way that directly replicates the original samples, and under-sampling, which reduces samples with a larger number of samples in a way that discards these redundant samples.

In general, under-sampling is considered when the total number of samples is large, while over-sampling is considered when the number of samples is small.

More details on data sampling can be found in Oversampling and undersampling in data analysis

4

Synthetic samples

Synthetic Samples are samples that are designed to increase the number of samples that are smaller, and synthesis means that new samples are created by combining the features of existing samples.

One of the simplest methods is to randomly select an existing value from each feature and splice it into a new sample. This method increases the number of samples in a category with a small number of samples, and works like the over-sampling method mentioned above, with the difference that the above method is simply copying the sample, while here it is splicing to get a new sample.

A representative method in this category is SMOTE (Synthetic Minority Over-sampling Technique), which performs random selection of features among similar samples and splices new samples.

More detailed information on SMOTE can be found in the paper SMOTE: Synthetic Minority Over-sampling Technique

5

Changing the sample weights

Changing the sample weights refers to increasing the weights of samples from the category with a smaller number of samples, so that when such samples are misclassified, their loss values are multiplied by the corresponding weights, thus allowing the classifier to pay more attention to the samples in this category with a smaller number.

Reference.

For more articles, please visit: http://wulc.me/

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

In classification, how do you handle an unbalanced training set?