How has data scraping for machine learning become the most labor-intensive bottleneck since manual data entry in legacy migration?

Q:

How has data scraping for machine learning become the most labor-intensive bottleneck since manual data entry in legacy migration?

A:

One of the practical problems that companies might encounter when trying to start a machine learning (ML) project is the challenge of acquiring the initial training data sets. This might include labor-intensive processes like web scraping or other data scraping.

The terms web scraping and data scraping largely refer to automated activity by computer software, but for many ML projects, there are going to be cases where computers don't have the sophistication to collect the right targeted data, so it will have to be done "by hand." This you might call "human web/data scraping," and it's a thankless job. It generally involves going out and looking for data or images to "feed" the ML program through training sets. It's often pretty iterative, which makes it tedious, sluggish, demanding work.

Data scraping for ML training sets represents a uniquely problematic bottleneck in machine learning, partly because so much of the other work is highly conceptual and not repetitive. Many people can come up with a great idea for a new app that performs machine learning tasks, but the nuts and bolts and the practical work can be a lot harder. In particular, delegating the work of assembling the training sets can actually be one of the hardest parts of an ML project, as fully explored in Mike Judge’s "Silicon Valley" TV show. In a season four episode, a startup entrepreneur first bullies a partner into doing the labor-intensive work, then tries to pass it off on college students by disguising it as a homework assignment.

This example is instructive because it shows how disliked and seemingly unimportant the manual data scraping is. However, it also shows that this process is necessary for a wide range of machine learning products. Although most people hate data entry, the training sets have to be assembled in some way. Experts on the process often recommend using a web scraping service – essentially just outsourcing this very labor-intensive work to external parties, but that could have security ramifications and cause other problems. When keeping the manual data collection work in-house, again, there has to be a provision made for what is often a very manual and time-consuming process.

In some ways, "human data scraping" for machine learning looks like the manual data entry that sometimes had to be done in legacy migration. As the cloud became more and more popular, and companies put their processes and workflows into the cloud, some found that they had not worked through the practical aspects of how to get their corporate data from an isolated legacy system into cloud-native applications. As a result, some people who were otherwise data scientists or creative people with essential IT skills found themselves doing unpleasant data entry tasks.

The same is likely to happen with machine learning. You might hear a data scientist complaining that “I’m a creative person” or “I’m on the development side” – but somebody has to do the dirty work.

Again, if the creative flow isn't matched by a practical assessment of workflow delegation, there's going to be a mismatch in how the task handling is directed. When a company doesn't have people to do the data scraping work in collecting data sets, it lacks a key part of the chain of procedure for a successful project. It's worth keeping this in mind any time a company tries to make good on an idea that's based around developing new machine learning applications.

Have a question? Ask us here.

View all questions from Justin.

Share this:
Written by Justin Stoltzfus
Profile Picture of Justin Stoltzfus
Justin Stoltzfus is a freelance writer for various Web and print publications. His work has appeared in online magazines including Preservation Online, a project of the National Historic Trust, and many other venues.
 Full Bio