How can engineers evaluate training sets and test sets to spot possible overfitting in machine learning?


How can engineers evaluate training sets and test sets to spot possible overfitting in machine learning?


To understand how this is done, it's necessary to have a basic grasp of the roles of different data sets in a typical machine learning project. The training set is set up to give the technology a frame of reference – a data baseline that the program uses to make predictive and probabilistic decisions. The test set is where you test the machine out on data.

Overfitting is a syndrome in machine learning where the model doesn't fully fit the data or the purpose.

One of the overarching commandments of machine learning is that training data and test data should be separate data sets. There is a fairly broad consensus on this, at least in many applications, because of some specific problems with using the same set that you used for training to test a machine learning program.

When a machine learning program utilizes a training set, which could be called essentially a set of inputs, it's working off that training set to make decisions about predictive results. One very basic way to think about it is that the training set is the "food" for the intellectual computing process.

Now when that same set is used for testing, the machine can often return excellent results. That's because it has already seen that data before. But the whole goal of machine learning in many cases is to make results about data that hasn't been seen before. General-purpose machine learning programs are made to operate on diverse sets of data. In other words, the principle of machine learning is discovery, and you don't usually get as much of that by using an initial training set for test purposes.

In evaluating training sets and test sets for possible overfitting, engineers might assess results and figure out why a program might do so differently on the comparative results of these two sets, or in some cases how the machine might do too well on the training data itself.

In capably describing some of these problems in machine learning in a 2014 piece, Jason Brownlee at Machine Learning Mastery describes overfitting this way:

"A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset," Brownlee writes. "The reason is that the model is not as generalized. It has specalized to the structure in the training dataset (italics added). This is called overfitting, and it’s more insidious than you think."

In lay terms, you could say that in specializing itself to the training data set, the program is becoming too rigid. That's another metaphorical way to look at why a machine learning program isn't optimally served by using the training set for the test set. It's also a good way to approach evaluating these two different sets, because the results will show engineers a lot about how the program is working. You want a smaller gap between accuracy for both models. You want to make sure that the system is not overfed or "precision-fused" to a particular data set, but that is more general and able to grow and evolve on command.

Have a question? Ask us here.

View all questions from Justin Stoltzfus.

Share this:
Written by Justin Stoltzfus
Profile Picture of Justin Stoltzfus
Justin Stoltzfus is a freelance writer for various Web and print publications. His work has appeared in online magazines including Preservation Online, a project of the National Historic Trust, and many other venues.
 Full Bio