Why Diversity is Essential for Quality Data to Train AI
Machine learning is a tool that many organizations use to make predictions. The problem is that some of these decisions reinforce biases. The solution is diversity.
Artificial Intelligence (AI) is no longer just a projection into future uses but a part of business practices. Machine learning (ML) is a tool used by businesses for predictive modeling that is used in an array of industries, from healthcare to finance to security.
The question that businesses have to address is: Are we being careful to not misuse AI by having it reinforce human biases in the training data?
To get insight into the various factors that play into that assurance, Martine Bertrand, Lead AI at Samasource in Montreal shared her thoughts. Bertrand holds a Ph.D. in physics and has applied her scientific rigor to ML and AI.
The Source of Bias
Bertrand concurs with what other experts have pointed out: “The model doesn’t choose to have a bias,” but rather she said it: “learns from the data it is exposed to.” Consequently a data set that is biased toward a certain category, class, gender, or color of skin will likely produce an inaccurate model.
We saw several examples of such biased models in Can AI Have Biases? Bertrand referred to one of the instances, that of Amazon’s Rekognition. It came under fire over a year ago when Joy Buolamnwini focused her research on its effects.
Buolamnwini found that while Rekognition did have 100% accuracy in recognizing light-skinned males and 98.7% accuracy even for darker males, the accuracy dropped to 92.9% for women with light skin and just 68.6% accuracy for darker-skinned women
Despite the demand for its removal from law enforcement agencies, the software remained in use. Bertrand finds that outrageous because of the potential danger inherent in relying on biased outcomes in that context.
How to Counter the Bias at the Source
Combating bias at the source begins with awareness of the potential problem and setting up a team that features diversity. That means not only gender diversity but also ethnic diversity, skin color diversity, and living experience diversity.
Bertrand explained: "The more diversity of those involved in the data, curating it, and assessing its quality, the better the chance of identifying the bias in data the source. You have to catch it at the point of data training to be sure to add in the necessary data to avert biased results from data that lacks a diverse perspective."
Can Explainable AI Help?
Bertrand agrees that model interpretability or explainability is a desirable aim that is a very active field of research, though it’s not a simple thing to achieve.
"Deep learning models, which many people refer to as AI, have millions and billions of parameters that make it really hard to understand or visualize why a given set of parameters will lead to a given prediction. It is possible to peek inside the machinery and see how a prediction has come to be by tracing through intermediate predictions that lead to the final one."
Bertrand added: "However, the attempts to assess intermediate places inside the model to see what’s happening are never 100% accurate, and solutions are still in development.”
Assessing the Data Used for Model Training
Bertrand explains there is greater visibility and control on the data that determines the outcomes. “As we’re acting right at the source of model training, acting where the data sets are being annotated.” That enables them to to accurately assess the distribution of various classes in the data set and remove noise and incorrect labels.
That’s the service that Samasource provides to its clients and partners, she explained. “Advising them what to do to alleviate bad quality data or bias inside their data sets.” The approach is to assess the quality of their clients’ data “using a mixture of statistics and a previously trained modeld to see if the data” is up to standard. If it is not, they can suggest additional data for the customer and provide a diversity of viewpoints if that was missing.
To illustrate incomplete data, he offered the example of setting up a model for an autonomously driven car. If it is only trained on highway data and is missing data on the conditions in a city or densely populated areas, then the model has insufficient data for the desired outcome.
The assessment of data quality fed into the model is the antidote to garbage-in-gabage-out (GIGO). She calls it: “quality-in-quality-out.”
The good news is that companies are aware of the problem and want tools and help to evaluate possible biases, and Bertrand believes that diversity will deliver the solution to that problem.
“We believe with our core value of diversity we’ll be able to achieve that."
She is also optimistic that diversity serves as a foundation for building a better future in which people will be more receptive to others and their ideas: “While humans do have their biases, what would be interesting is to expose our kids to a diversity of viewpoints early on so they build an acceptance of diversity and be open to that and see the benefits of accepting people who don’t look like them, think like them, etc., and see the benefits they can offer.”