Why might some machine learning projects require enormous numbers of actors?

Answer

When you think of machine learning, you tend to think of skilled data scientists working on keyboards in computer rooms. There is an extreme emphasis on quantitative analysis and algorithms. There's not a whole lot of immediate real-world context to many of these programs – at least, that's what many would think.

However, some of today's most groundbreaking machine learning programs are making use of veritable armies of human actors out on the street, in stores and anywhere that they can model basic human activities like walking, working or shopping.

Free Download: Machine Learning and Why It Matters

A Wired article by Tom Simonite illustrates this very well with the apt title "To Make AI Smarter, Humans Perform Oddball Low-Paid Tasks."

Using the example of short videos taken in a Whole Foods grocery store, Simonite highlights the kinds of work that will help build out some of the next phase of machine learning.

This leads to the question of why all of these people are engaged in filming themselves in short and simple videos documenting actions as rudimentary as moving an arm or leg.

The answer sheds some light on where machine learning is and where it is going.

“Researchers and entrepreneurs want to see AI understand and act in the physical world,” Simonite writes, explaining why he and others are roving with cameras. “Hence the need for workers to act out scenes in supermarkets and homes. They are generating the instructional material to teach algorithms about the world and the people in it.”

As many experts will point out, some of the biggest frontiers of machine learning involve image processing and natural language processing. These are extremely quantitative procedures – in other words, there's not a wide spectrum of inputs like there are in "performant" real-world environments. Instead, the machine learning programs are using visual and audio data in very specific ways to build models. With image processing, it's picking features from a (finite) field of vision. For NLP, it's assembling phonemes.

Going beyond these specific input categories involves something you might call the "image and speech gap" – in going beyond things like image processing and speech recognition, you're moving into areas where computers have to be analytical in different ways. The training sets will be fundamentally different.

Enter the army of videographers. In some of these new machine learning projects, the smallest ideas of human activities are the training sets. Instead of being trained to look for features and edges and pixels that compose into classification tasks, computers are instead using training videos to assess what different types of action look like.

The key thing is what engineers can do with this data when it is aggregated and loaded, and when the computer is trained on it. You'll soon see the results in various fields – for instance, this will make surveillance extremely effective. Computers will be able to "see" in the visual field what people are doing, and apply that to fields like marketing and sales, or perhaps, in some cases, government agency work or criminal justice.

The ramifications also put some light on the debate between maximum benefit and privacy questions. Much of the use of these videos will build machine learning models that work for surveillance – but what about people who don't want to be surveilled? When these new machine learning programs are deployed in the public space, what are the rights of the individual and where is that line drawn?

In any case, companies are using these sorts of human and video resources to really dig into some next-level rounds of machine learning progress that will actually enable computers to recognize what's happening around them, rather than just classifying images or working with the phonemes of speech. This is an extremely interesting and controversial development in artificial intelligence, and one that deserves its share of attention in the tech media and beyond.