The Problems with Imbalanced Datasets and How to Solve Them

January 28, 2020

In one of our latest posts my colleague David shared some valuable insights regarding predicting customer churn with deep learning for one of Wovenware’s healthcare clients. He mentioned that a key part of the business problem was working with an imbalanced dataset. This blog post is the second in a three-part series where I will share insights about the problems that frequently arise when learning from an imbalanced dataset as well as some potential solutions that can help improve a deep learning algorithm’s overall performance.

Imbalanced datasets are often encountered when solving real-world classification tasks such as churn prediction. In this context an imbalanced dataset refers to data samples from one or more classes that significantly outnumber the samples from the rest of the classes in the dataset. For example, consider a dataset with classes A and B, where 90% of the data belongs to class A and 10% belongs to class B. This dataset would be considered imbalanced. A common, unwanted situation that results from such a dataset is that a naive classifier could score 90% accuracy by always predicting class A.

In order to solve the problem of customer churn with imbalanced data the available feature set should be inspected for signs of naive behavior that could jeopardize the model. Tell-tale signs of a naive classifier can also be uncovered by inspecting its resulting confusion matrix. You could notice the classifier predicted majority-class labels for most, if not all, of the test samples. Additional signs can be uncovered by visualizing a histogram or a probability density plot of the same feature for separate classes – you could see a majority-class distribution include most of the minority-class distribution.

As discussed above, accuracy as a standalone performance metric can be misleading when learning from an imbalanced dataset. Accuracy metrics should be complemented with a combination of precision, recall, and F-measure, as well as visualizing receiver operating curves (ROCs) and comparing the area under each model’s ROCs (AUROC). For this specific deep learning model we favored recall over precision because of the simple fact that managing a false positive churn prediction would cost a lot less than losing a customer due to a false negative.

Potential solutions for learning from an imbalanced dataset to gain optimal classification results could involve one or more of the following:

  • Random under-sampling of the majority class or over-sampling of the minority class
  • Generation of synthetic minority class data points. One approach is the popular synthetic minority oversampling technique (SMOTE) where the minority class is up-sampled by choosing a datapoint, randomly weighting its nearest neighbor, and adding that distortioned neighbor to the original sample. This is done repeatedly until the desired sampling rate is reached
  • Stratified batch sampling to ensure each batch is balanced at the desired sampling rate
  • Engineering new features from existing features
  • Collection of additional data if the real-world phenomenon being modeled could naturally yield quasi balanced data

In our particular use case, we noticed we were dealing with a set of features whose distributions overlapped. Most of the existing features were clearly not separable and we decided to discard those and engineer new ones. We experimented with those engineered features and varying class ratios of minority to majority samples by randomly under -sampling the majority class or up-sampling the minority class with SMOTE. Figure 1 below shows the ROC and AUROC for our best performing model.

Figure 1: ROC and AUROC of our customer’s current best performing churn model.

Figure 1: ROC and AUROC of our customer’s current best performing churn model.

We are aware that balancing the dataset with under-/over-sampling, changes the real-life phenomenon it once described. Our justification for this was based on the type of model we trained, and the trends discovered in our customer’s data. We trained a deep neural network and without balancing or a batching strategy, such as the one mentioned above, most of the weight updates would favor majority class samples. As per the data, our customer’s churn trends are fairly periodic year over year, where features behave in a similar fashion during corresponding seasons even after under-sampling. Taking this feature engineering approach for this churn task was supported by our knowledge about the client’s data.

The above example shows the importance of having business acumen in order to make needed decisions that could make the difference in the final outcome of an algorithm. Otherwise, models could produce inaccurate results.

For additional details regarding model validation and the strategy used to avoid statistical bias in our results, stay tuned for part three of this three-part blog post series on predicting customer churn through AI models.

 

Leave a Reply

  • (will not be published)