How can engineers evaluate training sets and test sets to spot possible overfitting in machine learning?


To understand how this is done, it's necessary to have a basic grasp of the roles of different data sets in a typical machine learning project. The training set is set up to give the technology a frame of reference – a data baseline that the program uses to make predictive and probabilistic decisions. The test set is where you test the machine out on data.

Overfitting is a syndrome in machine learning where the model doesn't fully fit the data or the purpose.

Free Download: Machine Learning and Why It Matters

One of the overarching commandments of machine learning is that training data and test data should be separate data sets. There is a fairly broad consensus on this, at least in many applications, because of some specific problems with using the same set that you used for training to test a machine learning program.

When a machine learning program utilizes a training set, which could be called essentially a set of inputs, it's working off that training set to make decisions about predictive results. One very basic way to think about it is that the training set is the "food" for the intellectual computing process.

Now when that same set is used for testing, the machine can often return excellent results. That's because it has already seen that data before. But the whole goal of machine learning in many cases is to make results about data that hasn't been seen before. General-purpose machine learning programs are made to operate on diverse sets of data. In other words, the principle of machine learning is discovery, and you don't usually get as much of that by using an initial training set for test purposes.

In evaluating training sets and test sets for possible overfitting, engineers might assess results and figure out why a program might do so differently on the comparative results of these two sets, or in some cases how the machine might do too well on the training data itself.

In capably describing some of these problems in machine learning in a 2014 piece, Jason Brownlee at Machine Learning Mastery describes overfitting this way:

"A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset," Brownlee writes. "The reason is that the model is not as generalized. It has specalized to the structure in the training dataset (italics added). This is called overfitting, and it’s more insidious than you think."

In lay terms, you could say that in specializing itself to the training data set, the program is becoming too rigid. That's another metaphorical way to look at why a machine learning program isn't optimally served by using the training set for the test set. It's also a good way to approach evaluating these two different sets, because the results will show engineers a lot about how the program is working. You want a smaller gap between accuracy for both models. You want to make sure that the system is not overfed or "precision-fused" to a particular data set, but that is more general and able to grow and evolve on command.

Related Terms

Justin Stoltzfus

Justin Stoltzfus is an independent blogger and business consultant assisting a range of businesses in developing media solutions for new campaigns and ongoing operations. He is a graduate of James Madison University.Stoltzfus spent several years as a staffer at the Intelligencer Journal in Lancaster, Penn., before the merger of the city’s two daily newspapers in 2007. He also reported for the twin weekly newspapers in the area, the Ephrata Review and the Lititz Record.More recently, he has cultivated connections with various companies as an independent consultant, writer and trainer, collecting bylines in print and Web publications, and establishing a reputation…