Translate

Tuesday, September 25, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Data Integration and transformation


Data Integration:

 

Data integration: combines data from multiple sources. Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id? B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units

 

Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases, the same attribute may have different names in different databases. One attribute may be a “derived” attribute in another table. Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality.

Data transformation:

  1. Normalization:
    • Scaling attribute values to fall within a specified range.
      • Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)
    • Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev
  2. Aggregation: moving up in the concept hierarchy on numeric attributes.
  3. Generalization: moving up in the concept hierarchy on nominal attributes.
  4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

Data reduction:

  1. Reducing the number of attributes
    • Data cube aggregation: applying roll-up, slice or dice operations.
    • Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space
    • Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..
  2. Reducing the number of attribute values
    • Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
    • Clustering: grouping values in clusters.
    • Aggregation or generalization
  3. Reducing the number of tuples
    • Sampling

 

Discretization and generating concept hierarchies:

  1. Unsupervised discretization - class variable is not used.
    • Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.
    • Equal-frequency (equidepth) binning: use intervals containing equal number of values.
  2. Supervised discretization - uses the values of the class variable.
    • Using class boundaries. Three steps:
      • Sort values.
      • Place breakpoints between values belonging to different classes.
      • If too many intervals, merge intervals with equal or similar class distributions.
    • Entropy (information)-based discretization. Example:
      • Information in a class distribution:
        • Denote a set of five values occurring in tuples belonging to two classes (+ and -) as [+,+,+,-,-]
        • That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples
        • Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs are base 2)
        • 3/5 and 2/5 are relative frequencies (probabilities)
        • Ignoring the order of the values, we can use the following notation: [3,2] meaning 3 values from one class and 2 - from the other.
        • Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
      • Information in a split (2/5 and 3/5 are weight coefficients):
        • Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
        • Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])
      • Method:
        • Sort the values;
        • Calculate information in all possible splits;
        • Choose the split that minimizes information;
        • Do not include breakpoints between values belonging to the same class (this will increase information);
        • Apply the same to the resulting intervals until some stopping criterion is satisfied.
  3. Generating concept hierarchies: recursively applying partitioning or discretization methods.

 

 

 

1 comment: