Descriptive mining of
complex data objects
In modern data
mining applications, the data objects are getting more and more complex. Thus,
the extraction of meaningful feature representations yields a variety on
different views on the same set of data objects. Each of these views or
representations might focus on a different aspect and may offer another notion of
similarity. However, in almost any application there is no universal feature
representation that can be used to express similarity between all possible
objects in a meaningful way. Thus, recent data mining approaches employ
multiple representations to achieve more general results that are based on a
variety of aspects.
An example
application for multi-represented objects is data mining in protein data. A
protein can be described by multiple feature transformations based upon its
amino acid sequence, its secondary or its three dimensional structure. Another
example is data mining in image data which might be represented by texture features,
color histograms or text annotations. Mining multi-represented objects yields
advantages because more information can be incorporated into the mining
process. On the other hand, the additional information has to be used carefully
since too much information might distort the derived patterns. Basically, we can
distinguish two problems when clustering multirepresented objects,
comparability and semantics. The comparability problem subsumes several effects
when comparing features, distances or statements from different
representations. For example, a distance value of 1,000 might indicate
similarity in some feature space and a distance of 0.5 might indicate
dissimilarity in another space. Thus, directly comparing the distances is not
advisable. Other than the comparability problem the semantics problem is caused
by differences between the knowledge that can be derived from each
representation. For example, two images described by very similar text
annotations are very likely to be very similar as well.
On the other
hand, if the words describing two images are completely disjunctive the
implication that both images are dissimilar is rather weak because it is
possible to describe the same object using a completely different set of words.
Another type of semantics can be found in color histograms. An image of a plane
in blue skies might provide the same color distribution as a sailing boat in
the water. However, if two color images have completely different colors, it is
usually a strong hint that the images are really dissimilar. To cluster
multi-represented objects with respect to both problems, described a
multi-represented version of the density-based clustering algorithm DBSCAN. However,
this approach is still very sensitive with respect to its parameterization.
Therefore, the density-based, hierarchical clustering algorithm OPTICS to the
setting of multi-represented objects is used. This new version of OPTICS is far
less sensitive to parameter selection. Another problem of density-based
multi-represented clustering is the handling of the semantic problem for
multiple representations.
No comments:
Post a Comment