What Does Outlier Detection Mean?
Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data.
An outlier may be defined as a piece of data or observation that deviates drastically from the given norm or average of the data set. An outlier may be caused simply by chance, but it may also indicate measurement error or that the given data set has a heavy-tailed distribution.
Here is a simple scenario in outlier detection, a measurement process consistently produces readouts between 1 and 10, but in some rare cases we get measurements of greater than 20.
These rare measurements beyond the norm are called outliers since they “lie outside” the normal distribution curve.
Techopedia Explains Outlier Detection
There is really no standardized and rigid mathematical method for determining an outlier because it really varies depending on the set or data population, so its determination and detection ultimately becomes subjective. Through continuous sampling in a given data field, characteristics of an outlier may be established to make detection easier.
There are model-based methods for detecting outliers and they assume that the data are all taken from a normal distribution and will identify observations or points, which are deemed to be unlikely based on mean or standard deviation, as outliers. There are several methods for outlier detection:
- Grubb’s Test for Outliers – This is based upon the assumption that the data are of a normal distribution and removes one outlier at a time with the test being iterated until no more outliers can be found.
- Dixon’s Q Test – Also based upon normality of the data set, this method tests for bad data. It has been noted that this should be used sparingly and never more than once in a data set.
- Chauvenet’s Criterion – This is used to analyze if the outlier is spurious or is still within the boundaries and be considered as part of the set. The mean and standard deviation are taken and the probability that the outlier occurs is calculated. The results will determine if it is should be included or not.
- Pierce’s Criterion – An error limit is set for a series of observations, beyond which all observations will be discarded as they already involve such great error.