Clustering high dimensional
data
Clustering high-dimensional data is the cluster analysis
of data with anywhere from a few dozen to many thousands of dimensions.
Such high-dimensional data spaces are often encountered in areas such as
medicine, where DNA microarray technology can produce
a large number of measurements at once, and the clustering of text documents,
where, if a word-frequency vector is used, the number of dimensions equals the
size of the dictionary.
Four problems need to be overcome for clustering
in high-dimensional data:
- Multiple dimensions are hard to
think in, impossible to visualize, and, due to the exponential growth of
the number of possible values with each dimension, complete enumeration of
all subspaces becomes intractable with increasing dimensionality. This
problem is known as the curse of
dimensionality.
- The concept of distance becomes
less precise as the number of dimensions grows, since the distance between
any two points in a given dataset converges. The discrimination of the
nearest and farthest point in particular becomes meaningless:
- A cluster is intended to group
objects that are related, based on observations of their attribute's
values. However, given a large number of attributes some of the attributes
will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that
share similar blood values, which might lead to insights about the
relevance of certain blood values for a disease. But for different
diseases, different blood values might form a cluster, and other values
might be uncorrelated. This is known as the local feature relevance problem: different clusters might
be found in different subspaces, so a global filtering of attributes is
not sufficient.
- Given a large number of
attributes, it is likely that some attributes are correlated. Hence, clusters might exist
in arbitrarily oriented affine subspaces.
Approaches:
1.
Subspace clustering
2.
Projected clustering
3.
Correlation clustering
Subspace clustering is the task of detecting all clusters in all subspaces. This means that a
point might be a member of multiple clusters, each existing in a different
subspace. Subspaces can either be axis-parallel or affine. The term is often
used synonymous with general clustering in high-dimensional data.
The image on the right shows a
mere two-dimensional space where a number of clusters can be identified. In the
one-dimensional subspaces, the clusters
,
cannot be considered a cluster in a
two-dimensional (sub) space, since it is too sparsely distributed in the x
axis. In two
dimensions, the two clusters
can be
identified.
different
subspaces of a space with d dimensions.
If the subspaces are not axis-parallel, an infinite number of subspaces is possible.
Hence, subspace clustering algorithm utilizes some kind of heuristic
to remain computationally feasible, at the risk of producing inferior results.
No comments:
Post a Comment