P-Hacking a Statistical Pitfall of Machine Learning

A factor that can significantly affect a machine learning model’s performance is P-hacking. In broad terms, it refers to the incorrect usage of data that produces misleading statistical results. The P in P-hacking refers to the p-value, a measure of the statistical significance of an experiment’s results. In other words, P-hacking is the improvement of the p-value via statistical malpractice.

Malicious P-hacking is a big problem in academic research, but another equally large problem is unintentional P-hacking, which can lead businesses to waste valuable resources by unknowingly deploying ineffective machine learning models.

An effective way to avoid unintentional P-hacking in machine learning is to separate your data into training and holdout sets so that, ideally, your model receives no information about the data in the holdout set during the experimentation and training phases of the model production cycle. It is only after we are confident that we have trained a promising model that we should test it against the holdout set. This approach provides the following key benefits:

  1. Ensures the reliability of performance results on the holdout set
  2. Saves costly data resources

Suppose we have a dataset which we partition into a training set and a holdout set. Taking the example a step further, suppose that every single time we train our model on the training set we proceed to test it on the holdout set, evaluate the results, and tune the model’s hyperparameters according to what we believe will improve its results on the holdout set. If we repeat this process dozens of times the consequence will be that we have unintentionally leaked information to our model about the data in the holdout set every time we modified our model’s hyperparameters in response to it. In practice we are violating the fundamental statistical assumption that our model is making predictions on previously unobserved data. In other words, we have conformed our model to the holdout set. Our model’s performance is no longer a reliable indicator of performance in-the-wild.

Collecting and labeling data is a time-consuming and costly effort. If we neglect the scenario described above and indiscriminately use all our data to train and test our model we would exhaust all the data that can give us reliable performance results. The only way to correct this would be to collect and label more data, spending more time and money, in order to test our model on previously unseen data which can give us reliable feedback of our model’s performance in-the-wild.

The best way to avoid unintentional P-hacking is to plan ahead and separate enough data in the holdout set. Use the holdout set to test the machine learning model only after you are confident that it’s promising. This may mean that you should split your holdout set into several subsets to avoid fitting your model to this data.

By following this strategy, you can be assured of your model’s reliability on previously unseen data and save time and money in the short term and a reliable and effective machine learning algorithm in the long term.

The Problems with Imbalanced Datasets and How to Solve Them

In one of our latest posts my colleague David shared some valuable insights regarding predicting customer churn with deep learning for one of Wovenware’s healthcare clients. He mentioned that a key part of the business problem was working with an imbalanced dataset. This blog post is the second in a three-part series where I will share insights about the problems that frequently arise when learning from an imbalanced dataset as well as some potential solutions that can help improve a deep learning algorithm’s overall performance.

Imbalanced datasets are often encountered when solving real-world classification tasks such as churn prediction. In this context an imbalanced dataset refers to data samples from one or more classes that significantly outnumber the samples from the rest of the classes in the dataset. For example, consider a dataset with classes A and B, where 90% of the data belongs to class A and 10% belongs to class B. This dataset would be considered imbalanced. A common, unwanted situation that results from such a dataset is that a naive classifier could score 90% accuracy by always predicting class A.

In order to solve the problem of customer churn with imbalanced data the available feature set should be inspected for signs of naive behavior that could jeopardize the model. Tell-tale signs of a naive classifier can also be uncovered by inspecting its resulting confusion matrix. You could notice the classifier predicted majority-class labels for most, if not all, of the test samples. Additional signs can be uncovered by visualizing a histogram or a probability density plot of the same feature for separate classes – you could see a majority-class distribution include most of the minority-class distribution.

As discussed above, accuracy as a standalone performance metric can be misleading when learning from an imbalanced dataset. Accuracy metrics should be complemented with a combination of precision, recall, and F-measure, as well as visualizing receiver operating curves (ROCs) and comparing the area under each model’s ROCs (AUROC). For this specific deep learning model we favored recall over precision because of the simple fact that managing a false positive churn prediction would cost a lot less than losing a customer due to a false negative.

Potential solutions for learning from an imbalanced dataset to gain optimal classification results could involve one or more of the following:

  • Random under-sampling of the majority class or over-sampling of the minority class
  • Generation of synthetic minority class data points. One approach is the popular synthetic minority oversampling technique (SMOTE) where the minority class is up-sampled by choosing a datapoint, randomly weighting its nearest neighbor, and adding that distortioned neighbor to the original sample. This is done repeatedly until the desired sampling rate is reached
  • Stratified batch sampling to ensure each batch is balanced at the desired sampling rate
  • Engineering new features from existing features
  • Collection of additional data if the real-world phenomenon being modeled could naturally yield quasi balanced data

In our particular use case, we noticed we were dealing with a set of features whose distributions overlapped. Most of the existing features were clearly not separable and we decided to discard those and engineer new ones. We experimented with those engineered features and varying class ratios of minority to majority samples by randomly under -sampling the majority class or up-sampling the minority class with SMOTE. Figure 1 below shows the ROC and AUROC for our best performing model.

Figure 1: ROC and AUROC of our customer’s current best performing churn model.

Figure 1: ROC and AUROC of our customer’s current best performing churn model.

We are aware that balancing the dataset with under-/over-sampling, changes the real-life phenomenon it once described. Our justification for this was based on the type of model we trained, and the trends discovered in our customer’s data. We trained a deep neural network and without balancing or a batching strategy, such as the one mentioned above, most of the weight updates would favor majority class samples. As per the data, our customer’s churn trends are fairly periodic year over year, where features behave in a similar fashion during corresponding seasons even after under-sampling. Taking this feature engineering approach for this churn task was supported by our knowledge about the client’s data.

The above example shows the importance of having business acumen in order to make needed decisions that could make the difference in the final outcome of an algorithm. Otherwise, models could produce inaccurate results.

For additional details regarding model validation and the strategy used to avoid statistical bias in our results, stay tuned for part three of this three-part blog post series on predicting customer churn through AI models.

Helping a Healthcare Insurance Provider Predict Customer Churn

Wovenware’s data science team recently began working with a major healthcare provider to help it better predict customer churn and more proactively prevent it. Customer churn is an issue that impacts service providers everywhere, It represents the percentage of customers that stop using a service for one reason or another. Companies are committed to keeping customer churn as low as possible because the cost of acquiring new customers is actually higher than the cost to retain existing customers. They realize that any improvement in customer churn has a big impact on revenue.

Challenges to Addressing Customer Churn

Our healthcare client has a few peculiarities that make it a challenge to keep customer churn in check. Its customers can choose to change their service provider at any time, but it is notified at the end of the month when it’s too late for any remedial action. This limits its ability to identify the customer’s reason for leaving. In addition to that, the nature of the business also limits the value of the data related to customer behavior. Consider this, if a customer is using his health insurance does it means that he’s happy with the service or that he is just sick? On the other hand , what about a customer that hardly ever uses his health insurance, does it means that he is unhappy with the service or that he is just healthy?

Our strategy to address our client’s customer churn, given the limitations mentioned above, was to build a predictive deep learning model to help it know which customers were at a higher risk of canceling their subscription in the upcoming month. Data that helped build the model included existing customer demographic data and health insurance claims, such as dollar amounts and type of claims. The resulting live predictions would give the provider enough time to contact the high-risk clients and address any need they have before they cancel their membership.

How Did We Address It?

So how did we accomplish this? First, we processed the claims data because it consisted of millions of data points with multiple entries, per day, for each customer. We also consolidated the claims data of each customer by month (since this is the timeframe the client uses to measure customer churn). Then, we analyzed the consolidated data to find patterns that could help us identify valuable features to train the deep learning model. We compared the data points of customers that stopped using the service to data points of customers that continued using the service, and found that all demographic and claims data followed the same distribution, which was a roundabout way to find that we had no meaningful features to train a deep learning model.

Given this setback, we decided to engineer new features by performing arithmetic operations on other claims features, which turned out to be valuable. We also used Pearson Correlation Coefficients to determine the strength of the relationships between features and kept the features with the strongest relationships as the indicators of customer churn.

What we found is that the occurrence of a customer leaving is actually rare, and leads to an unbalanced dataset, which is a problem when training a deep learning model. A model trained with an imbalanced dataset could learn to correctly predict the prevalent case and perform poorly when presented with a rare case, which for us is the case of interest.

Our Approach

The architecture we employed used three fully connected layers, a single neuron at the output, and the sigmoid activation function. We optimized a binary cross entropy loss using a sigmoid  output. A portion of the dataset was used for training and an another portion was used to test the trained model. The portion of the dataset used for testing is called the holdout set. It was especially important to handle the holdout set with care because we wanted to avoid statistical bias on our results.

In the second blog post of this three-part series, read about our approach to model validation with the holdout set.

It’s Time to Act on the AI Talent Shortage

There’s been a lot of talk about the shortage of data scientists and engineers, and unfortunately, the problem is going to get worse before it gets better. When you consider the increasing demand for Artificial Intelligence (AI) expertise in all types of businesses and the role that AI is playing in making companies more competitive, there’s no question that it’s a serious issue.

We’re seeing AI applications across industries, in situations as diverse as saving the environment, predicting who will be re-admitted to hospitals or which medical device might fail, and it seems like use cases keep on coming. As Andrew Ng, a noted computer scientist, was quoted as saying, “I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.”

And, industry statistics bear that out. According to a Stanford University, AI Index report, there are 4.5 more jobs in the field since 2013. Glassdoor found that data scientists lead the pack when it comes to salary, job satisfaction and available positions. And, an Ernst & Young survey found that the biggest obstacle to implementing AI projects throughout the organizations was the shortage of skilled AI professionals, according to 56% of respondents.

Part of the problem is that we’re just not graduating enough data experts. Campuses, such as Stanford University and Boston University, are offering degree programs, and new training programs are cropping up everywhere, but even with these programs, we just can’t keep up with the burgeoning demand.

This is concerning on many levels – from an individual company’s market outlook on a micro level to our nation’s ability to compete on the international stage. I believe we must act decisively, proactively and swiftly to address this problem, and I’ve identified four areas where we can have positive impact:

  • Growing the government role. The government needs to do more to bridge the AI talent gap. For example, it can allocate more money to R&D to develop advanced AI tools to enable data scientists to build algorithms faster, better, more accurately. It also should provide more grants and other economic incentives to encourage people to learn data science through courses at training programs, colleges or in advanced degree.
  • Educate early and often. Schools should be teaching technology classes to everyone in order to prepare the next generation of students for the mid-21st century workplace. Cultivating an interest in and curiosity about tech should begin at the earliest levels of education, even in kindergarten. And that’s just the beginning. Throughout primary and secondary school, classes in technology should be required and taught alongside science, math, literature, history and language to provide the must-have knowledge in today’s world. Everyone should learn how to code before they graduate.
    Colleges should build upon this knowledge with more advanced courses, and more universities and graduate programs should be offered, providing critical, in-depth expertise, particularly in data engineering and data science.
  • Retraining is a key part of the solution. Companies should seize opportunities to retrain their workforce from roles that are shrinking to ones that will continue to be in demand, such as data science. For example, workers trained in traditional coding and legacy systems would be ideal candidates to learn data science. Similarly, the government should promote retraining programs for skilled, educated workers from other fields.
  • Strategic partnerships are the cure. Companies looking to fast-track key AI initiatives are turning to solution providers, nearshorers and other strategic business partners. These firms provide highly educated, trained data scientists and engineers, and the advanced GPU processors and infrastructure to manage huge amounts of data. Leveraging these types of partners not only helps companies address a talent shortage, but it can be a more cost-effective long-term solution, since it can be costly to build these types of AI resources in-house.

In addition to these measures, we must fire on all cylinders to make a difference, and that requires collaboration. Just like the government-private sector industry initiatives in STEM, we can bring multiple stakeholders together to form consortiums to address the AI talent shortage and build a brighter future for AI innovation. Surely, if we put our heads together and collaborate, we can achieve a groundswell of interest in AI careers, but it begins with setting out sites on the goal and then making it happen.

Deepfakes: A Serious Cause for Concern

Recently I wrote about the roots of deepfake technology and its consequences, in a Forbes Technology Council article.

Deepfakes are based on AI and alter real video content or images to create fake ones. Some examples include super-imposing someone’s face on a celebrity’s body. This distortion of media content, however, is already causing quite a stir. Consider the CNN reporter who was the victim of a doctored video that was made to look like he was shoving a press conference facilitator, or a video of House Speaker Nancy Pelosi who was made to appear drunk and slurring her words.

These types of deepfakes can harm reputations, but they also can have even more catastrophic consequences. Imagine an edited video made to sound like a world leader declaring war; or verbally attacking a foreign country.

As I shared in the Forbes article, the same AI technology that has enabled deepfakes can be the antidote to eliminate them. Already, large tech firm such as Facebook and Microsoft have taken initiatives to create technology that can detect and remove them, by amassing giant databases to train algorithms on how to sniff them out.

How to Spot a Deepfake

As amateur “deepfakers” jump on the bandwagon, creating what is coming to be known as cheap fakes, there are tell-tale signs that what you are seeing or hearing is not real. Take for example, people that do not blink in videos, or shadows don’t look like they are falling in the right places. Other sure signs of a fake can be faces that don’t fit the body or those that appear to blur. It is useful to have both good and poor deepfakes for training algorithms in order to make them as comprehensive as possible.

It’s Time for Government to Step In

Despite these efforts, AI alone won’t be able to eliminate the deepfake problem – it also requires government intervention. Last June, congress held its first-ever hearing on deepfakes, during which the idea of making social media firms liable for the damages caused by deepfakes was discussed. This may be a step in the right direction, since currently there are no impunity laws in place to govern the spread of deepfakes. There needs to be some liability for those that create and share them.


There’s no doubt that advanced technologies are doing great things, but they also have the power to cause harm. To change the course of deepfakes, applying AI technology to sniff them out, while establishing more regulatory control of social media platforms, just may help put a stop to the deceit, lies and danger they pose.

Putting a New Face on Traditional OutSourcing

An article I recently wrote for the Future of Sourcing addressed the key trends driving a new generation of IT outsourcing. The IT industry has changed dramatically and in order for outsourcing to succeed it must follow suit.

Perhaps what stands out the most when its comes to outsourcing transformation is the relationship between the service provider and buyer. Today’s service buyer is not looking only for arms and legs. In most cases, they’re looking for true partners who can bring more to the table in terms of expertise and how it can bear on very specific business problems.

As I illustrated in the article, following are some very real ways in which the nature of outsourcing has changed and will continue to change:

  • The switch to a provider ecosystem. A key change is that a company can no longer outsource its IT project in a vacuum. Today businesses require a whole team of providers, who can work in partnership with each other to bring key capabilities to the project. This is vastly different from the past, when companies turned to big consulting firms or ERP systems to deliver the capabilities that they needed in a one-stop-shop.
  • The need for Industry-specific expertise. It’s not enough for today’s outsourcer to be a technology expert, it must also be an expert in specific market sectors, such as retail or banking. This is particularly critical in AI development, where knowledge of specific business problems is paramount. In this area, it’s is essential for an outsourcer to have a firm grasp of key issues to ask the right questions, gather the relevant data and develop an effective algorithm.
  • Growing compliance demands. With the complexity of advanced AI and other IT solutions, as well as the need to protect sensitive customer data, ensuring compliance to federal, industry-specific and corporate rules and regulations is critical. Today’s outsourcers need to be proficient in these requirements and be able to easily demonstrate compliance, and help companies establish the processes, protocols and software they need to achieve compliance with their industry regulations.

Driven by increasing complexity, advanced AI innovation and a growing understanding that business drives IT demands, IT outsourcing is not what it was just a few short years ago – but it’s even better. The future holds great promise for outsourcing and many companies understand that it simply makes good business sense.