Translate

Tuesday, September 25, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Data Discretization and Concept hierarchy Generation






Data Discretization and Concept hierarchy Generation

  1. Unsupervised discretization - class variable is not used.
    • Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.
    • Equal-frequency (equidepth) binning: use intervals containing equal number of values.
  2. Supervised discretization - uses the values of the class variable.
    • Using class boundaries. Three steps:
      • Sort values.
      • Place breakpoints between values belonging to different classes.
      • If too many intervals, merge intervals with equal or similar class distributions.
    • Entropy (information)-based discretization. Example:
      • Information in a class distribution:
        • Denote a set of five values occurring in tuples belonging to two classes (+ and -) as [+,+,+,-,-]
        • That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples
        • Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs are base 2)
        • 3/5 and 2/5 are relative frequencies (probabilities)
        • Ignoring the order of the values, we can use the following notation: [3,2] meaning 3 values from one class and 2 - from the other.
        • Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
      • Information in a split (2/5 and 3/5 are weight coefficients):
        • Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
        • Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])
      • Method:
        • Sort the values;
        • Calculate information in all possible splits;
        • Choose the split that minimizes information;
        • Do not include breakpoints between values belonging to the same class (this will increase information);
        • Apply the same to the resulting intervals until some stopping criterion is satisfied.
  3. Generating concept hierarchies: recursively applying partitioning or discretization methods.

 

 

No comments:

Post a Comment