Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG LECTURE NOTES-- Constraint based cluster analysis


Constraint-Based Clustering:

 

For domains in which one has knowledge as to which instances should and should not be clustered together, one can apply constraint based clustering methods. Constraint-based clustering was originally introduced by Wagstaff, using a modification of K-means that takes into account must-link constraints, where two points must be in the same cluster, and cannot-link constraints, where two points cannot be in the same cluster. These are hard constraints, and thus any clustering of the data points must satisfy the constraints. In essence, the algorithm reassigns examples to their nearest cluster at each iteration, but only if it does not violate any constraints. Because the EM algorithm often provides better results than K-means, due to its ability to model specific distributions, solutions have been introduced that address the use of hard constraints in the EM algorithm. Both must-link and cannot-link constraints are added by forcing the expectation to zero if the constraints are not met. Hard constraints are not always available or desirable.

In this approach, a soft clustering algorithm is applied and constraints are expressed as a prior probability that pairs of instances should (or should not) be assigned to the same cluster. These probabilistic constraints are combined with the probability of generating the data by the mixture model, to get a combined likelihood function. Clustering with pair wise constraints in this setting can be formulated as inference in a corresponding Bayesian network. Constraint based clustering is related but distinct from semi supervised clustering where the labels (cluster membership) of some of the examples are known in advance, and therefore the induced constraints are more explicit. Another closely related approach includes spectral clustering and other graph based clustering algorithms where pair wise similarities between examples provide the only information available to induce the clustering.

No comments:

Post a Comment