Hierarchical and partition
based clustering
, is an indicator of the distance
of the n data points from their
respective cluster centres.
Hierarchical clustering:
In
data
mining, hierarchical clustering is a method of cluster
analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering
generally fall into two types:
- Agglomerative: This is a bottom up approach
Each observation starts in its own cluster,
and pairs of clusters are
merged as one
moves up the hierarchy.
- Divisive: This is a top down approach
All
observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
In
general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually
presented in a dendrogram.
For
example, suppose this data is to be clustered, and the Euclidean distance
is the distance metric.
Cutting
the tree at a given height will give a partitioning clustering at a selected
precision. In this example, cutting after the second row will yield clusters
{a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c}
{d e f}, which is a coarser clustering, with a smaller number of larger
clusters.
Raw
Data:
This
method builds the hierarchy from the individual elements by progressively
merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and
{f}. The first step is to determine which elements to merge in a cluster.
Usually, we want to take the two closest elements, according to the chosen
distance.
Optionally,
one can also construct a distance
matrix at this stage, where the number in
the i-th row j-th column is the distance between the i-th
and j-th elements. Then, as clustering progresses, rows and columns are
merged as the clusters are merged and the distances updated. This is a common
way to implement this type of clustering, and has the benefit of caching
distances between clusters. A simple agglomerative clustering algorithm is
described in the single-linkage clustering page; it can easily be adapted to different types of
linkage.
Partition based clustering:
The partitioning methods generally result in a
set of M clusters, each object belonging to one cluster. Each cluster may be
represented by a centroid or a cluster representative; this is some sort of
summary description of all the objects contained in a cluster. The precise form
of this description will depend on the type of the object which is being
clustered. In case where real-valued data is available, the arithmetic mean of
the attribute vectors for all objects within a cluster provides an appropriate
representative; alternative types of centroid may be required in other cases,
e.g., a cluster of documents can be represented by a list of those keywords
that occur in some minimum number of documents within a cluster. If the number
of the clusters is large, the centroids can be further clustered to produces
hierarchy within a dataset.
E.g.
K-means clustering
K-means (MacQueen,
1967) is one of the simplest
unsupervised learning algorithms that solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed a priori. The
main idea is to define k centroids, one for each cluster. These centroids shoud
be placed in a cunning way because of different location causes different
result. So, the better choice is to place them as much as possible far away
from each other. The next step is to take each point belonging to a given data
set and associate it to the nearest centroid. When no point is pending, the
first step is completed and an early groupage is done. At this point we need to
re-calculate k new centroids as barycenters of the clusters resulting from the
previous step. After we have these k new centroids, a new binding has to be
done between the same data set points and the nearest new centroid. A loop has
been generated. As a result of this loop we may notice that the k centroids
change their location step by step until no more changes are done. In other
words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
,
where
is a chosen distance measure
between a data point
and the cluster centre
The
algorithm is composed of the following steps:
- Place K points into the
space represented by the objects that are being clustered. These points
represent initial group centroids.
- Assign each object to the
group that has the closest centroid.
- When all objects have been
assigned, recalculate the positions of the K centroids.
4.
Repeat
Steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be
calculated.
Although
it can be proved that the procedure will always terminate, the k-means
algorithm does not necessarily find the most optimal configuration,
corresponding to the global objective function minimum. The algorithm is also
significantly sensitive to the initial randomly selected cluster centres. The
k-means algorithm can be run multiple times to reduce this effect.
K-means
is a simple algorithm that has been adapted to many problem domains. As we are
going to see, it is a good candidate for extension to work with fuzzy feature
vectors.
No comments:
Post a Comment