Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG LECTURE NOTES-- Clustering high dimensional data

Clustering high dimensional data


Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

Four problems need to be overcome for clustering in high-dimensional data:

  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the curse of dimensionality.
  • The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:


  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Approaches:

1.      Subspace clustering

2.      Projected clustering

3.      Correlation clustering

Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. The term is often used synonymous with general clustering in high-dimensional data.

 
The image on the right shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters
 (in subspace
 
) and
 
,
,
 (in subspace

 

) can be found. 
 
 
 cannot be considered a cluster in a two-dimensional (sub) space, since it is too sparsely distributed in the x axis. In two dimensions, the two clusters
 
and              
 
can be identified.

The problem of subspace clustering is given by the fact that there are
 
different subspaces of a space with d dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithm utilizes some kind of heuristic to remain computationally feasible, at the risk of producing inferior results.

 

No comments:

Post a Comment