Outlier
Analysis:
An outlier is an observation that
appears to deviate markedly from other observations in the sample.
Identification
of potential outliers is important for the following reasons.
- An outlier may indicate bad
data. For example, the data may have been coded incorrectly or an
experiment may not have been run correctly. If it can be determined that
an outlying point is in fact erroneous, then the outlying value should be
deleted from the analysis (or corrected if possible).
In some cases, it may not be possible to determine if an
outlying point is bad data. Outliers may be due to random variation or may
indicate something scientifically interesting. In any event, we typically do
not want to simply delete the outlying observation. However, if the data
contains significant outliers, we may need to consider the use of robust
statistical techniques.
Issues:
- Outlier labeling - flag
potential outliers for further investigation (i.e., are the potential
outliers erroneous data, indicative of an inappropriate distributional
model, and so on).
2.
Robust statistical techniques that
will not be unduly affected by outliers. That is, if we cannot determine that
potential outliers are erroneous observations, do we need modify our
statistical analysis to more appropriately account for these observations?
- Outlier identification -
formally test whether observations are outliers.
Single
vs. multiple outliers:
Some outlier tests are designed to detect the presence of a
single outlier while other tests are designed to detect the presence of
multiple outliers. It is not appropriate to apply a test for a single outlier
sequentially in order to detect multiple outliers.
In addition, some tests that detect multiple outliers may
require that you specify the number of suspected outliers exactly.
Outlier detection methods can be
divided between univariate methods,
proposed in earlier works in this field, and multivariate methods that usually form most of the current body
of research. Another fundamental taxonomy of outlier detection methods is
between parametric (statistical) methods and nonparametric methods that are
model-free. Statistical parametric methods either assume a known underlying
distribution of the observations or, at least, they are based on statistical
estimates of unknown distribution parameters. These methods flag as outliers
those observations that deviate from the model assumptions.
They are often
unsuitable for high-dimensional data sets and for arbitrary data sets without
prior knowledge of the underlying data distribution. Within the class of
non-parametric outlier detection methods one can set apart the data-mining
methods, also called distance-based
methods. These methods are usually based on local distance measures and
are capable of handling large databases.
No comments:
Post a Comment