P-Hacking a Statistical Pitfall of Machine Learning

February 10, 2020

A factor that can significantly affect a machine learning model’s performance is P-hacking. In broad terms, it refers to the incorrect usage of data that produces misleading statistical results. The P in P-hacking refers to the p-value, a measure of the statistical significance of an experiment’s results. In other words, P-hacking is the improvement of the p-value via statistical malpractice.

Malicious P-hacking is a big problem in academic research, but another equally large problem is unintentional P-hacking, which can lead businesses to waste valuable resources by unknowingly deploying ineffective machine learning models.

An effective way to avoid unintentional P-hacking in machine learning is to separate your data into training and holdout sets so that, ideally, your model receives no information about the data in the holdout set during the experimentation and training phases of the model production cycle. It is only after we are confident that we have trained a promising model that we should test it against the holdout set. This approach provides the following key benefits:

  1. Ensures the reliability of performance results on the holdout set
  2. Saves costly data resources

Suppose we have a dataset which we partition into a training set and a holdout set. Taking the example a step further, suppose that every single time we train our model on the training set we proceed to test it on the holdout set, evaluate the results, and tune the model’s hyperparameters according to what we believe will improve its results on the holdout set. If we repeat this process dozens of times the consequence will be that we have unintentionally leaked information to our model about the data in the holdout set every time we modified our model’s hyperparameters in response to it. In practice we are violating the fundamental statistical assumption that our model is making predictions on previously unobserved data. In other words, we have conformed our model to the holdout set. Our model’s performance is no longer a reliable indicator of performance in-the-wild.

Collecting and labeling data is a time-consuming and costly effort. If we neglect the scenario described above and indiscriminately use all our data to train and test our model we would exhaust all the data that can give us reliable performance results. The only way to correct this would be to collect and label more data, spending more time and money, in order to test our model on previously unseen data which can give us reliable feedback of our model’s performance in-the-wild.

The best way to avoid unintentional P-hacking is to plan ahead and separate enough data in the holdout set. Use the holdout set to test the machine learning model only after you are confident that it’s promising. This may mean that you should split your holdout set into several subsets to avoid fitting your model to this data.

By following this strategy, you can be assured of your model’s reliability on previously unseen data and save time and money in the short term and a reliable and effective machine learning algorithm in the long term.


Leave a Reply

  • (will not be published)