Why does "bagging" in machine learning decrease variance?


Why does bagging in machine learning decrease variance?


Bootstrap aggregation, or "bagging," in machine learning decreases variance through building more advanced models of complex data sets. Specifically, the bagging approach creates subsets which are often overlapping to model the data in a more involved way.

One interesting and straightforward notion of how to apply bagging is to take a set of random samples and extract the simple mean. Then, using the same set of samples, create dozens of subsets built as decision trees to manipulate the eventual results. The second mean should show a truer picture of how those individual samples relate to each other in terms of value. The same idea can be applied to any property of any set of data points.

Since this approach consolidates discovery into more defined boundaries, it decreases variance and helps with overfitting. Think of a scatterplot with somewhat distributed data points; by using a bagging method, the engineers "shrink" the complexity and orient discovery lines to smoother parameters.

Some talk about the value of bagging as "divide and conquer" or a type of "assisted heuristics." The idea is that through ensemble modeling, such as the use of random forests, those using bagging as a technique can get data results that are lower in variance. In terms of lessening complexity, bagging can also help with overfitting. Think of a model with too many data points: say, a connect-the-dots with 100 unaligned dots. The resulting visual data line will be jagged, dynamic, volatile. Then "iron out" the variance by putting together sets of evaluations. In ensemble learning, this is often thought of as joining several "weak learners" to provide a "strong learning" collaborative result. The result is a smoother, more contoured data line, and less wild variance in the model.

It's easy to see how the idea of bagging can be applied to enterprise IT systems. Business leaders often want a "bird's eye view" of what's going on with products, customers, etc. An overfitted model can return less digestible data, and more "scattered" results, where bagging can "stablilize" a model and make it more useful to end users.

Have a question? Ask us here.

View all questions from Justin Stoltzfus.

Share this:
Written by Justin Stoltzfus
Profile Picture of Justin Stoltzfus
Justin Stoltzfus is a freelance writer for various Web and print publications. His work has appeared in online magazines including Preservation Online, a project of the National Historic Trust, and many other venues.
 Full Bio