Data Quality: Why Diversity is Essential to Train AI

As artificial intelligence (AI) is quickly getting more robust and gaining capabilities, people are looking more closely at why it’s important to have diverse representation — both in the data that is fed into these algorithms, and in the teams of people who work on them. (Also read: Can AI Have Biases?)

We intuitively know this kind of diversity has to be there. We see the effort to bring a variety of voices to projects; we know how hard it is to course correct from older models that were often pretty homogenous. So what does diversity in AI look like?

Well, for one thing, companies have to work harder to make the team environment more of a melting pot in natural ways, and in ways that don’t single people out or lead other issues with discrimination. But if they can thread this needle well, it might give the firms with the best strategies and policies an actionable advantage in a quickly evolving field.

A resource from BCG at Davos contends that the best AI programs can deliver a lot to individual workers, including “increased competence … increased autonomy …and stronger relationships” — but that’s only if harmful biases do not impair the system.

In the future, AI should be trained using data from multiple sources, including those from different backgrounds and cultures, to ensure it’s making fair and equitable decisions for all.

The Cost of Poor Data Quality

A piece from the World Economic Forum (WEF) talks about how biases in algorithms are a “costly human oversight,” describing AI as a new frontier for civil rights, partly because of how much these AI systems can deliver privilege.

Advertisements

As models are used to dole out everything from personal loans to scholarships, experts argue, having that diversity in place will be essential. Experts point to numerous case studies where AI will end up conferring concrete benefits to individuals, based, again, on algorithms and their inputs. For instance, the WEF also maintains resources discussing how the automation of economic analysis can be dangerous to our sense of objectivity.

“Currently, AI machines are susceptible to bias against or toward theories,” experts write.

“The two main sources of AI bias are similar to those behind human cognitive bias: bias in the inputs … and bias in the methodology of looking at the data … As economists might obtain different results depending on their methodological preferences when looking at the same or different data, robots will also obtain different results depending on the literature or information they are fed and the models based on which they are supposed to look at … This bias can be managed or limited, but it cannot be avoided completely.”

And that’s just in economics — a subject that seems, to most people, heady and abstract. Wait until the AI is responsible for figuring out who is next in line for something, or who deserves a particular lenience in a criminal justice context or something else of that sort.

Robots are being tested in courtroom situations to see how well they can predict future behavior — another area where a diverse set of inputs could help. A piece from MIT Technology Review describes work on “predictive policing” by researchers at the MIT Media Lab.

As part of an effort to address racial disparities in the criminal justice system, the MIT researchers trained an AI algorithm on more than 3 million criminal records to predict future offences in two Boston neighborhoods. The algorithm was trained using an algorithmically determined set of variables from the arrest records, such as age, race, sex, arrest history and time since last arrest.

The trust that we have in these machines will, to some extent, match the progress that they make toward diversity. (Also read: RDS and Trust Aware Process Mining: Keys to Trustworthy AI?)

Diversity By The Numbers

A report by the AI Now Institute of New York University (profiled in Forbes by Maria Klawe) found that 80% of AI professors are men and only 15% of Facebook researchers, and 10% of Google researchers, are women. The same research found that less than 25% of PhDs in 2018 were awarded to women or minorities.

Jim Boerkoel, a grant applicant interviewed in the piece, talks about a lack of diverse thought and how that impacts AI:

“One of the challenges is that when there’s a lack of diversity, there’s a lack of diverse thought,” Boerkoel says. “If the population that is creating the technology is homogeneous, we’re going to get technology that is designed by and works well for that specific population. Even if they have good intentions of serving everyone, their innate biases drive them to design toward what they are most familiar with. As we write algorithms, our biases inherently show up in the decisions we make about how to design the algorithm or what and how data sets are used, and then these biases can get reified in the technology that we produce.”

Diverse people, he says, have to be in the room to think about the gaps and challenges in the AI platform as a whole. (Also read: How Technology Is Helping Companies Achieve Their DEI Goals in 2022.)

Maintaining Diverse Data Sets

To others, there’s more of a focus on the actual data: Steve Nouri, a Forbes Technology Council member and Head of Data Science and AI at Australian Computer Society, suggests that the data is going to drive what happens with results in an AI context. Nouri also cites centralization as a key problem.

“Today, the prowess of mass consumer software lies in the hands of the few giant software companies,” Nouri writes. “These companies control and drive the development of products that will impact a majority of the world’s population. Technology is altering lifestyle and human behavior that is beyond our control. AI has the ability to develop self-driving cars, robots that talk to us and much more. Therefore, diversity in AI is necessary to avoid bias in the roots of these systems. The actual problem is polarizing and systemic, and it demands inclusion and true ownership.”

Seeing Results in Applications

As new AI tools start to do more in our society, we see the outputs of technologies that may not have been built with diversity in mind. For example, a widely syndicated piece by Hannah Getahun details a lot of what many see as ChatGPT bias, in which detractors of OpenAI’s platform fear the amplification of ethnocentric views (witness a code module the AI built to determine who would make a good scientist). Then, from MIT/Sloan Review, Ayanna Howard and Charles Isbell bring us analysis on past efforts to roll out AI, one of which included lightening a photo subject’s skin.

“A heated, multi-day online debate ensued, dividing the field into two distinct camps:” the duo write. “Some argued that the bias shown in the results came from bad (that is, incomplete) data being fed into the algorithm, while others argued that it came from bad (that is, short-sighted) decisions about the algorithm itself, including what data to consider.”

How are these debates resolved, and will companies be able to put the needed guardrails in place?

Conclusion: Diversity is Key to Data Quality

To resolve these debates and give companies the tools they require to put necessary guardrails in place, data quality will be critically important. Diversity will be crucial to support ethical and transparent AI from diverse perspectives, to make sure the data used to train AI is diverse and to ensure we have a handle on how these tools will change our lives.

In the future, we can expect to see more research into AI ethics and greater efforts to ensure that AI technology serves everyone equally. Additionally, AI will continue to revolutionize many aspects of our lives, from transportation to healthcare and beyond. (Also read: Experts Share 5 AI Predictions for 2023.)