Model based
Clustering:
In analyzing DNA microarray gene expression data a
major role has been played by various cluster analysis techniques such as
hierarchical clustering, k-means clustering and self organizing maps are under
the category of discriminative approach. In discriminative approaches, such as
clustering via graph partitioning, one determines a distance or similarity
function between pairs of data objects, and then groups similar objects
together into clusters, whereas model based approach called generative
approach, attempt to learn generative models from the data, with each model
representing one particular cluster. In discriminative approach, the most
commonly used distance measures are Euclidean distance and Manhattan distance for
data that can be represented in a vector space. For high-dimensional data clustering the
impact of different similarity measures have been studied and showed that
Euclidean distances are not appropriate for this domain. For complex data types
(e.g., variable length sequences), defining a good similarity measure is very
much data dependent and often requires expert domain knowledge. The
disadvantage of similarity-based approaches is that calculating the
similarities between all pairs of data objects are computationally inefficient,
requiring a complexity of O (N2) .
Even though the above clustering techniques
contribute significantly to our understanding of the various biological
phenomena, have some restrictions one of which is their inability to determine
the number of clusters . The difficulty may be related to the fact that in many
methods there is no clear definition of what a cluster is in the first place.
Furthermore, their clustering results may not be stable. For model-based
clustering approaches, the model type is often specified a priori, such as Gaussian or Hidden
Markov models (HMMs). The model structure (e.g., the number of hidden states in
an HMM) can be determined by model selection techniques and parameters
estimated using maximum likelihood algorithms, e.g., the EM algorithm .
Probabilistic model-based clustering techniques have
shown promising results in a corpus of applications. Gaussian mixture models
are the most popular models used for vector data; a multinomial model have been
shown to be effective for high dimensional data clustering and provides
alternative solution to these issues. According to model based clustering, a
cluster is a sub-population with a certain distribution, and several
statistical methods can be applied to estimate the number of clusters. The role
of model based clustering in the context of detecting differentially expressed
genes, which is to identify all genes with altered expression under two
experimental conditions. They provide a statistical framework to model the
cluster structure of gene expression data.
In model based clustering, it is assumed that the
data generated by a mixture of underlying probability distributions in which
each component represents a different group (or) cluster . In this approach,
the data set is assumed to come from a finite mixture of underlying probability
distributions, with each component corresponding to a different cluster. The
goal is to estimate the parameters that maximize the likelihood of the number
of data objects. For continuous data, such as gene expression data it is
assumed that the data to be clustered are from several subpopulations with
distinguished normal distributions. That is each data point is taken from a
normal mixture distribution with the probability density function:
f(y;Фg)
=
πi
φ(y; μi, Vi) 2.5
Where φ(y; μi, Vi)
denotes the normal density function with mean μi and covariance
matrix Vi and πi’s are mixing proportions. Фg represents
all unknown parameters (πi, μi, Vi) and g is
the number of components. In model based clustering first the mixture model is
fitted to the data and obtain the maximum likelihood estimate. Then the
posterior probabilities of each data point belonging to each of the components
can be calculated. Finally each data
point is assigned to the component with the largest posterior probability. The
mixture model is typically fitted by maximum likelihood using
expectation-maximization algorithm. Given n observations (samples) the maximum
log log-likelihood is
Log
L (Фg) =
log f (yj; Фg) 2.6
to obtain the maximum likelihood estimate Фg.
The EM algorithm
computes Фg by iterating the following
steps. Suppose that at the kth iteration, the parameter estimates are πi (k) , μi(k)
and Vi(k) . Then in the (k+1) th
iteration, the estimates are updated. When the EM algorithm converges, each
observation is assigned to the group with the maximum conditional probability .
One interesting but difficult problem in cluster analysis is to determine the
number of components. In model based clustering, the model should produce
optimum number of clusters for the maximum log-likelihood value.
An important
advantage of model based clustering is that they provide an estimated
probability that data object i will belong to cluster k. If the gene expression
data are typically “highly connected”, there may be instances in which a single
gene has a high correlation with two different clusters. Thus the probabilistic
feature of model based clustering is particularly suitable for gene expression
data. Since the model based approach is based on the assumption that the data
are distributed normally according to a
mixture of normal distributions, we assess whether this assumption
holds before applying the model based approach. Moreover, the raw gene
expression data do not satisfy the normal mixture distribution we have to
normalize the data using mean and standard deviation.
No comments:
Post a Comment