clustering methods
The goal of clustering is to reduce the amount of
data by categorizing or grouping similar data items together. Such grouping is
pervasive in the way human’s process information, and one of the motivations
for using clustering algorithms is to provide automated tools to help in
constructing categories or taxonomies The methods may also be used to minimize
the effects of human factors in the process.
Clustering methods can be divided into two basic
types: hierarchical and partition based clustering. Within each of the types there
exists a wealth of subtypes and different algorithms for finding the clusters.
Hierarchical clustering proceeds successively by
either merging smaller clusters into larger ones, or by splitting larger
clusters. The clustering methods differ in the rule by which it is decided
which two small clusters are merged or which large cluster is split. The end
result of the algorithm is a tree of clusters called a dendrogram, which shows
how the clusters are related. By cutting the dendrogram at a desired level a
clustering of the data items into disjoint groups is obtained.
Partition based clustering, on the other hand,
attempts to directly decompose the data set into a set of disjoint clusters.
The criterion function that the clustering algorithm tries to minimize may
emphasize the local structure of the data, as by assigning clusters to peaks in
the probability density function, or the global structure. Typically the global
criteria involve minimizing some measure of dissimilarity in the samples within
each cluster, while maximizing the dissimilarity of different clusters.
A commonly used partitioned clustering method,
K-means clustering
will be discussed in some detail since it is closely related to the SOM algorithm. In K-means clustering the criterion function is the average squared distance of the data items
from
their nearest cluster centroids,
will be discussed in some detail since it is closely related to the SOM algorithm. In K-means clustering the criterion function is the average squared distance of the data items
where
is the index of the centroid that is closest to
. One possible algorithm for minimizing the cost function begins by initializing a set of K cluster centroids denoted by
,
. The positions of the
are then adjusted iteratively by first assigning the data samples to the nearest clusters and then recomputing the centroids. The iteration is stopped when E does not change markedly any more. In an alternative algorithm each randomly chosen sample is considered in succession, and the nearest centroid is updated.
is the index of the centroid that is closest to
. One possible algorithm for minimizing the cost function begins by initializing a set of K cluster centroids denoted by
,
. The positions of the
are then adjusted iteratively by first assigning the data samples to the nearest clusters and then recomputing the centroids. The iteration is stopped when E does not change markedly any more. In an alternative algorithm each randomly chosen sample is considered in succession, and the nearest centroid is updated.
Equation is also used to describe the
objective of a related method, vector quantization. In vector quantization the
goal is to minimize the average (squared) quantization error, the
distance between a sample x and
its representation.
The algorithm for minimizing Equation that was described above is actually a straightforward generalization of the algorithm proposed by Lloyd (1957) for minimizing the average quantization error in a one-dimensional setting.
The algorithm for minimizing Equation that was described above is actually a straightforward generalization of the algorithm proposed by Lloyd (1957) for minimizing the average quantization error in a one-dimensional setting.
A problem with the clustering methods is that the
interpretation of the clusters may be difficult. Most clustering algorithms
prefer certain cluster shapes, and the algorithms will always assign the data
to clusters of such shapes even if there were no clusters in the data.
Therefore, if the goal is not just to compress the data set but also to make
inferences about its cluster structure, it is essential to analyze whether the
data set exhibits a clustering tendency. The results of the cluster analysis
need to be validated, as well. Another potential problem is that the choice of
the number of clusters may be critical: quite different kinds of clusters may
emerge when K is changed. Good initialization of the cluster centroids
may also be crucial; some clusters may even be left empty if their centroids
lie initially far from the distribution of data.
Clustering can be used to reduce the amount of
data and to induce a categorization. In exploratory data analysis, however, the
categories have only limited value as such. The clusters should be illustrated
somehow to aid in understanding what they are like. For example in the case of
the K-means algorithm the centroids that represent the clusters are still
high-dimensional, and some additional illustration methods are needed for
visualizing them.
No comments:
Post a Comment