Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG LECTURE NOTES-- Descriptive mining of complex data objects

Descriptive mining of complex data objects

In modern data mining applications, the data objects are getting more and more complex. Thus, the extraction of meaningful feature representations yields a variety on different views on the same set of data objects. Each of these views or representations might focus on a different aspect and may offer another notion of similarity. However, in almost any application there is no universal feature representation that can be used to express similarity between all possible objects in a meaningful way. Thus, recent data mining approaches employ multiple representations to achieve more general results that are based on a variety of aspects.

 

An example application for multi-represented objects is data mining in protein data. A protein can be described by multiple feature transformations based upon its amino acid sequence, its secondary or its three dimensional structure. Another example is data mining in image data which might be represented by texture features, color histograms or text annotations. Mining multi-represented objects yields advantages because more information can be incorporated into the mining process. On the other hand, the additional information has to be used carefully since too much information might distort the derived patterns. Basically, we can distinguish two problems when clustering multirepresented objects, comparability and semantics. The comparability problem subsumes several effects when comparing features, distances or statements from different representations. For example, a distance value of 1,000 might indicate similarity in some feature space and a distance of 0.5 might indicate dissimilarity in another space. Thus, directly comparing the distances is not advisable. Other than the comparability problem the semantics problem is caused by differences between the knowledge that can be derived from each representation. For example, two images described by very similar text annotations are very likely to be very similar as well.

 

On the other hand, if the words describing two images are completely disjunctive the implication that both images are dissimilar is rather weak because it is possible to describe the same object using a completely different set of words. Another type of semantics can be found in color histograms. An image of a plane in blue skies might provide the same color distribution as a sailing boat in the water. However, if two color images have completely different colors, it is usually a strong hint that the images are really dissimilar. To cluster multi-represented objects with respect to both problems, described a multi-represented version of the density-based clustering algorithm DBSCAN. However, this approach is still very sensitive with respect to its parameterization. Therefore, the density-based, hierarchical clustering algorithm OPTICS to the setting of multi-represented objects is used. This new version of OPTICS is far less sensitive to parameter selection. Another problem of density-based multi-represented clustering is the handling of the semantic problem for multiple representations.

No comments:

Post a Comment