What Is Data Bias

Artificial Intelligence technology is completely dependent on the data sets that are provided to train its underlying machine learning (ML) model. Machine learning models are built by developers based on their collected and annotated training data sets. These training data get used to training the ML model to make predictions about the world.

The better the annotated data, the better the predictions. Problems arise when that data is wrong or distorted. The end results become faulty. Wrong or distorted data can be attributed to many things. Often it means that the data have been labeled inaccurately, contain errors and are of poor quality. When this faulty data gets used for training the machine learning system, the outcome will not be as expected. This means predictive models will fail.

Categorization decisions made by humans can also cause distortion. It's a "garbage in/garbage out" situation. This condition is called data bias. (Read also: Can AI Have Biases?)

Bias In AI and Machine Learning

Machine learning is the part of artificial intelligence (AI) that helps systems to learn and improve from experience without continuous traditional programming. Data bias is the difference in data from its most accurate form of representation.

Bad data inserts incorrect “facts” into useful information. Bias in AI represents the situations where machine learning-based data analytics systems differentiate specific groups of people. This discrimination often involves our social biases in various categories like race, gender, assigned sex, nationality, age, etc.

Bias occurs when an algorithm shows results that are wrong due to the errors in the assumptions of the machine learning process. So, machine learning bias generally comes from the errors created by the individuals who are responsible for designing and training the machine learning systems.

How Bad Data Damages Machine Learning

Wrong data can have disastrous effects on ML systems. Incomplete or missing fact data, incorrect data, and data bias are the key factors that can ruin a machine learning system. (Read also: The Promises and PItfalls of Machine Learning.)

Real-life Examples

Machine learning bias has been a known risk for a long time. In fact, machine learning bias has already been found in real-world cases, with bias resulting in negative consequences. COMPAS and IBM Watson are two such examples:

  • COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) uses machine learning to predict how likely a defendant is to commit another crime in the future. It's an algorithm used by judges to help determine appropriate sentences in several U.S. states and jurisdictions. Later research found that COMPAS predicted very inaccurately for violent crime recidivism and based on Black or white skin color—findings that have been disputed by the company that owns COMPAS. The research brings up issues around using machine learning algorithms and how human flaws like racial discrimination can result in machine-learned flaws.

  • Watson Many criticisms have been leveled against IBM Watson when it comes to its foray into medicine. The Jeopardy-winning supercomputer parses hundreds of thousands of medical studies to deliver research-based suggestions to doctors. But determining which studies to favor more heavily (versus studies that were flawed or biased) was not a strong point of the algorithm, resulting in data that was unreliable. Also, some complained that Watson was biased towards American methods of diagnosis and treatment and that Watson had problems understanding handwritten medical prescriptions of doctors.

Machine Learning Bias Types

There are several ways that form the machine learning bias. Below are some of the major situations that create bias in machine learning models.

Sample Bias

Sample bias happens when the data used to train the algorithm does not perfectly represent the problem space the model operates in. In other words, this type of bias occurs when a data-set does not show the realities of the environment in which a model will run. An example of this is definite facial recognition systems trained mainly on the images of white men but used to identify all genders and skin colors. Another example is if an autonomous car is expected to function in the daytime and at night, but is only trained with nighttime data, then its training data includes sample bias.

Algorithm Bias

It happens when there's an issue within the algorithm that carries out the calculations which enable the machine learning computations. This type of bias has nothing to do with data. This type of bias reminds us that “bias” is overloaded.

Prejudicial Bias

Prejudicial bias (sometimes still referred to as racial bias) tends to dominate the headlines related to AI failures because it often impacts cultural and political matters. This bias happens while training data content gets influenced by stereotypes or prejudice held by the human trainer. Data scientists and companies must be required to make sure the algorithm doesn’t hold outputs that are conventional or prejudiced.

Measurement Bias

Systematic value deformation occurs when issues are noticed with the device that is used to observe or measure. This kind of bias alters the data in a specific direction and incorrect measurements result in data malformation. As an example, this type of bias occurs in image recognition datasets, wherein the training data gets collected by one type of camera, but the production data is gathered from a different camera. Measurement bias may also occur because of imperfect annotation at the time of the data labeling phase of a project.

Exclusion Bias

Exclusion bias happens when an important data point is missing or overlooked from the data being used. This is also very common in the data preprocessing stage. Most often it occurs due to removing valuable data erroneously considered to be unimportant.

Observer Bias

This type of bias is also known as confirmation bias. Observer bias happens when the observer purposefully finds the results they expect to see, independent of what the data states. It can occur when researchers join a project with an idea based on their subjective knowledge of what they have got from their study. This also happens when labelers use their subjective knowledge to control their labeling work, causing imperfect data.

Recall Bias

This is a type of measurement bias and it is also common in the data labeling phase. Recall bias takes place when similar types of data are labeled inconsistently. This affects the accuracy of the end result.

All these mean that AI systems always contain some amount of human error.

Fairness in Machine Learning

Fairness in machine learning means designing or creating algorithms in a machine system that are not influenced by any external prejudices and can produce desired results accurately. The training datasets used in machine learning models play a key role to help the system function properly and flawlessly. (Read also: Why Diversity is Essential for Quality Data to Train AI).

How to Eliminate Bias in Machine Learning

The removal of data bias in machine learning is a continuous process. Near constant clearing of data and machine learning bias is needed to build accurate and careful data collection processes.

Awareness and good administration can help prevent machine learning bias. Resolving data bias requires first deciding where the bias occurs. Once it is located, the bias can be removed from the system. However, it is often difficult to understand when the data or model is biased. Still, there are a number of steps that can be taken to control this kind of situation.

  • Testing and validating to ensure the results of machine learning systems don't produce bias due to algorithms or the data sets.

  • Ensuring that the group of data scientists and data labelers is different.

  • Establishing strict guidelines for data labeling expectations so data labelers can follow the steps while annotating data.

  • Bringing together multiple source inputs to assure data variety.

  • Analyzing the data on a regular basis. Keep a record of errors so that they can be solved as soon as possible.

  • Taking help from any domain expert to review collected and annotated data. Someone from outside of the team may notice unchecked biases.

  • Utilizing extra resources, like Google's What-if Tool or IBM's AI Fairness 360 Open Source Toolkit, to examine and inspect ML models.

  • Implementing multi-pass annotation for any project where data perfection may tend to get biased.

Final Thoughts

Machines require a high volume of data to learn and accurately annotating training data is as important as the learning algorithm itself. A common reason that ML models fail to run perfectly is that they were created based on imperfect, biased training data.

  • Training data must be accurate and high quality to remove bias from ML.

  • It’s necessary that organizations have tech teams with diverse members in both building models and creating training data.

  • If training data is produced from internal systems, it's needed to find the most comprehensive data and experiment with different datasets and metrics.

  • If training data is gathered by external partners, it is essential to recruit distributed crowd resources for data annotation.

  • It’s essential to verify if the training data has any implicit bias, once it is created.