Machine learning has sparked a lot of issues relating to bias. People are generally concerned with how machine learning operates ethically and fairly when making decisions. However, bias is intrinsic to machine learning and it will pop up many times in the development process. Developing a basic understanding of the types of bias in machine learning models is critical for understanding how it may positively or negatively impact the results of the model.
Bias in Healthcare Insurance
To better understand how the most common types of bias will come into play throughout the machine learning lifecycle, we will examine a real use case in the healthcare industry, using hypothetical and simplified data to better illustrate the concepts.
Looking to promote patient health, a private health insurance company was looking to leverage AI to provide members with product recommendations that optimize coverage and care for patients’ current health conditions. The goal of the model was to examine patients’ demographic and claims data to recommend products based on predictions about their future use.
The Bias term is a parameter that allows models to represent patterns that do not pass through the origin. In this example, a data scientist may study the relationship between age and medical spending in exploratory data analysis, he/she observes that the elderly generally incur more expensive medical treatments than other patients. The data does not include any extreme cases where both the age and medical spending have values of 0. In fact, minimum yearly medical spending in this dataset is actually $100.
The natural tendency for medical spending to move away from $0 will be represented in a mathematical equation with a bias term. The bias term is intrinsic to the data and needs to be incorporated into the descriptive model in order to get the expected results. In this context and scenario, bias is intentionally inserted into the model to optimize its performance in regards to representing what is observed from the data.
Bias Term in Linear Regression
For any given phenomenon, the bias term we include in our equations is meant to represent the tendency of the data to have a distribution centered about a given value that is offset from an origin; in a way, the data is biased towards that offset. For example, when we are given a linear regression problem, if we observe from the distribution of the data that most values are centered around a number ‘b’, our resulting model would need to factor in this ‘b’. In the case of linear regression, this idea would be represented with the traditional line equation ‘y = mx + b’, where ‘b’ is called the bias term or offset and represents the tendency of the regression result to land consistently offset from the origin near b units. It is a very common intentional bias in machine learning models.
Bias Term in Neural Networks
A bias term is also commonly represented as a bias neuron in artificial neural networks. The bias will determine when the node will be fired. When used within an activation function, the purpose of the bias term is to shift the position of the curve left or right to delay or accelerate the activation of a node. Data scientists often tune bias values to train models to better fit the data.
Also a common bias in machine learning models, Prediction bias is “a value indicating how far apart the average of predictions is from the average of labels in the dataset.” In this context, we are often interested in observing the Bias/Variance trade-off within our models as a way of measuring the model’s performance.
In the majority of applications, prediction bias is not deliberately included as part of a model’s design, but it is used as a measure to evaluate and tune the model. For example, suppose that in the scenario of the insurance plan recommender system, after the data science model is trained on existing demographic and claims data, testing results show that all members of a particular age group are always provided the same plan recommendation, regardless of their claims and conditions. This means that the model is generalizing for age, and not personalizing for the patients’ particular healthcare needs. In this scenario, the model is showing high bias and low variance, so the recommendations will not have the desired accuracy and the model must be tuned. In contrast, a different model with low bias and high variance, might hyper-personalize to an extent that it can only provide accurate recommendations for patients in the training dataset, but cannot identify general underlying patterns to provide recommendations for new patients. Data scientists tune and optimize models to have low bias and low variance in order to achieve expected results, but the bias/variance trade-off is intrinsic to the process at some point. One can’t be reduced without increasing the other.
There is a lot of buzz around ethical AI, and most of the issues concern trust, privacy, fairness and accountability. Bias can creep into a model in many stages in the machine learning lifecycle, from incorrectly labeling and sampling data, to optimizing models for inadequate variables. Bias ethics and fairness should be reviewed at each stage in the data science process in order to build ethical algorithms. The impact of ethical bias can be devastating to society as it can unintentionally disfavor vulnerable populations and perpetuate inequality.
In our insurance plan recommender example, the insurance company wants to ensure that economically disadvantaged groups and ethnic minorities are recommended the same plan as other groups with otherwise similar claims patterns and demographic data. The model has been tuned and is providing optimal plan recommendations for patients based on claims and demographic data, after analyzing the results, the data scientists find that, indeed, bias has crept into the algorithm and low income patients are being recommended plans with less coverage. Their analysis points to two potential variables that may be influencing the model: residence zip code and medical spending. People in disadvantaged communities with specific zip codes who have yearly spending significantly lower than average were being recommended plans that were not adequate for their healthcare needs. The data science team needs to further tune the model and ensure that the results are not just mathematically accurate, but that they are ethically unbiased and fair.
Bias, Bias, Bias. What Am I Going to Do with You?
Designing to account for bias in machine learning models is an intrinsic part of the ML process. Some types of bias will be intentionally inserted into mathematical equations, others need to be deliberately taken out of the equations. Regardless of which side of the equation the bias is on, machine learning models should be designed, trained and tested to promote trust, fairness, transparency and accountability across businesses and users. We must all take responsibility in safeguarding the ethical use of artificial intelligence algorithms in our society, by putting the right processes and checks in place.