Constraint-Based
Clustering:
For
domains in which one has knowledge as to which instances should and should not
be clustered together, one can apply constraint based clustering methods. Constraint-based
clustering was originally introduced by Wagstaff, using a modification of
K-means that takes into account must-link constraints, where two points must be
in the same cluster, and cannot-link constraints, where two points cannot be in
the same cluster. These are hard constraints, and thus any clustering of the
data points must satisfy the constraints. In essence, the algorithm reassigns
examples to their nearest cluster at each iteration, but only if it does not
violate any constraints. Because the EM algorithm often provides better results
than K-means, due to its ability to model specific distributions, solutions
have been introduced that address the use of hard constraints in the EM
algorithm. Both must-link and cannot-link constraints are added by forcing the
expectation to zero if the constraints are not met. Hard constraints are not
always available or desirable.
In
this approach, a soft clustering algorithm is applied and constraints are
expressed as a prior probability that pairs of instances should (or should not)
be assigned to the same cluster. These probabilistic constraints are combined
with the probability of generating the data by the mixture model, to get a
combined likelihood function. Clustering with pair wise constraints in this
setting can be formulated as inference in a corresponding Bayesian network. Constraint
based clustering is related but distinct from semi supervised clustering where
the labels (cluster membership) of some of the examples are known in advance,
and therefore the induced constraints are more explicit. Another closely related
approach includes spectral clustering and other graph based clustering
algorithms where pair wise similarities between examples provide the only
information available to induce the clustering.
No comments:
Post a Comment