Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES-- Model based Clustering:


Model based Clustering:

In analyzing DNA microarray gene expression data a major role has been played by various cluster analysis techniques such as hierarchical clustering, k-means clustering and self organizing maps are under the category of discriminative approach. In discriminative approaches, such as clustering via graph partitioning, one determines a distance or similarity function between pairs of data objects, and then groups similar objects together into clusters, whereas model based approach called generative approach, attempt to learn generative models from the data, with each model representing one particular cluster. In discriminative approach, the most commonly used distance measures are Euclidean distance and Manhattan distance for data that can be represented in a vector space.  For high-dimensional data clustering the impact of different similarity measures have been studied and showed that Euclidean distances are not appropriate for this domain. For complex data types (e.g., variable length sequences), defining a good similarity measure is very much data dependent and often requires expert domain knowledge. The disadvantage of similarity-based approaches is that calculating the similarities between all pairs of data objects are computationally inefficient, requiring a complexity of O (N2) .

Even though the above clustering techniques contribute significantly to our understanding of the various biological phenomena, have some restrictions one of which is their inability to determine the number of clusters . The difficulty may be related to the fact that in many methods there is no clear definition of what a cluster is in the first place. Furthermore, their clustering results may not be stable. For model-based clustering approaches, the model type is often specified a priori, such as Gaussian or Hidden Markov models (HMMs). The model structure (e.g., the number of hidden states in an HMM) can be determined by model selection techniques and parameters estimated using maximum likelihood algorithms, e.g., the EM algorithm .

Probabilistic model-based clustering techniques have shown promising results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data; a multinomial model have been shown to be effective for high dimensional data clustering and provides alternative solution to these issues. According to model based clustering, a cluster is a sub-population with a certain distribution, and several statistical methods can be applied to estimate the number of clusters. The role of model based clustering in the context of detecting differentially expressed genes, which is to identify all genes with altered expression under two experimental conditions. They provide a statistical framework to model the cluster structure of gene expression data.

In model based clustering, it is assumed that the data generated by a mixture of underlying probability distributions in which each component represents a different group (or) cluster . In this approach, the data set is assumed to come from a finite mixture of underlying probability distributions, with each component corresponding to a different cluster. The goal is to estimate the parameters that maximize the likelihood of the number of data objects. For continuous data, such as gene expression data it is assumed that the data to be clustered are from several subpopulations with distinguished normal distributions. That is each data point is taken from a normal mixture distribution with the probability density function:

f(y;Фg) =    πi φ(y; μi, Vi)                                                                                  2.5

Where φ(y; μi, Vi) denotes the normal density function with mean μi and covariance matrix Vi and πi’s are mixing proportions. Фg represents all unknown parameters (πi, μi, Vi) and g is the number of components. In model based clustering first the mixture model is fitted to the data and obtain the maximum likelihood estimate. Then the posterior probabilities of each data point belonging to each of the components can be calculated.  Finally each data point is assigned to the component with the largest posterior probability. The mixture model is typically fitted by maximum likelihood using expectation-maximization algorithm. Given n observations (samples) the maximum log log-likelihood is

Log L (Фg) = log f (yj; Фg)                                                           2.6               

to obtain the maximum likelihood estimate Фg.

The EM algorithm computes Фg   by iterating the following steps. Suppose that at the kth iteration, the parameter estimates are πi  (k) , μi(k)  and  Vi(k) .  Then in the (k+1) th iteration, the estimates are updated. When the EM algorithm converges, each observation is assigned to the group with the maximum conditional probability . One interesting but difficult problem in cluster analysis is to determine the number of components. In model based clustering, the model should produce optimum number of clusters for the maximum log-likelihood value.

An important advantage of model based clustering is that they provide an estimated probability that data object i will belong to cluster k. If the gene expression data are typically “highly connected”, there may be instances in which a single gene has a high correlation with two different clusters. Thus the probabilistic feature of model based clustering is particularly suitable for gene expression data. Since the model based approach is based on the assumption that the data are   distributed normally according to a mixture  of  normal   distributions, we assess whether this assumption holds before applying the model based approach. Moreover, the raw gene expression data do not satisfy the normal mixture distribution we have to normalize the data using mean and standard deviation.

 

 

No comments:

Post a Comment