Data Integration:
Data integration:
combines data from multiple sources. Schema integration integrate metadata from
different sources Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id? B.cust-# Detecting and resolving data
value conflicts for the same real world entity, attribute values from different
sources are different possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundant Data
in Data Integration:
Redundant data occur
often when integration of multiple databases, the same attribute may have
different names in different databases. One attribute may be a “derived”
attribute in another table. Redundant data may be able to be detected by
correlation analysis Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and improve mining speed and
quality.
Data transformation:
- Normalization:
- Scaling attribute values to
fall within a specified range.
- Example: to transform V in [min, max]
to V' in [0,1], apply V'=(V-Min)/(Max-Min)
- Scaling by using mean and
standard deviation (useful when min and max are unknown or when there are
outliers): V'=(V-Mean)/StDev
- Aggregation: moving up in the
concept hierarchy on numeric attributes.
- Generalization: moving up in
the concept hierarchy on nominal attributes.
- Attribute construction:
replacing or adding new attributes inferred by existing attributes.
Data
reduction:
- Reducing the number of attributes
- Data cube aggregation: applying roll-up, slice or dice
operations.
- Removing irrelevant attributes: attribute selection
(filtering and wrapper methods), searching the attribute space
- Principle component analysis (numeric attributes
only): searching for a lower dimensional space that can best represent
the data..
- Reducing the number of attribute values
- Binning (histograms): reducing the number of
attributes by grouping them into intervals (bins).
- Clustering: grouping values in clusters.
- Aggregation or generalization
- Reducing the number of tuples
- Sampling
Discretization and generating concept hierarchies:
- Unsupervised discretization
- class variable is not used.
- Equal-interval (equiwidth)
binning: split the whole range of numbers in intervals with equal size.
- Equal-frequency (equidepth)
binning: use intervals containing equal number of values.
- Supervised discretization -
uses the values of the class variable.
- Using class boundaries. Three
steps:
- Sort values.
- Place breakpoints between
values belonging to different classes.
- If too many intervals, merge
intervals with equal or similar class distributions.
- Entropy (information)-based
discretization. Example:
- Information in a class
distribution:
- Denote a set of five values
occurring in tuples belonging to two classes (+ and -) as [+,+,+,-,-]
- That is, the first 3 belong
to "+" tuples and the last 2 - to "-" tuples
- Then, Info([+,+,+,-,-]) =
-(3/5)*log(3/5)-(2/5)*log(2/5) (logs are base 2)
- 3/5 and 2/5 are relative
frequencies (probabilities)
- Ignoring the order of the
values, we can use the following notation: [3,2] meaning 3 values from
one class and 2 - from the other.
- Then, Info([3,2]) =
-(3/5)*log(3/5)-(2/5)*log(2/5)
- Information in a split (2/5
and 3/5 are weight coefficients):
- Info([+,+],[+,-,-]) =
(2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
- Or, Info([2,0],[1,2]) =
(2/5)*Info([2,0]) + (3/5)*Info([1,2])
- Method:
- Sort the values;
- Calculate information in all
possible splits;
- Choose the split that
minimizes information;
- Do not include breakpoints
between values belonging to the same class (this will increase
information);
- Apply the same to the
resulting intervals until some stopping criterion is satisfied.
- Generating concept hierarchies:
recursively applying partitioning or discretization methods.
Nice article...
ReplyDeleteMaster Data Management
Vendor Master Data Management
Material Master Data Management
Customer Master Data Management
Product Master Data Management
Service Master Data Management