Data Discretization and Concept hierarchy Generation |
- Unsupervised discretization
- class variable is not used.
- Equal-interval (equiwidth)
binning: split the whole range of numbers in intervals with equal size.
- Equal-frequency (equidepth)
binning: use intervals containing equal number of values.
- Supervised discretization -
uses the values of the class variable.
- Using class boundaries. Three
steps:
- Sort values.
- Place breakpoints between
values belonging to different classes.
- If too many intervals, merge
intervals with equal or similar class distributions.
- Entropy (information)-based
discretization. Example:
- Information in a class
distribution:
- Denote a set of five values
occurring in tuples belonging to two classes (+ and -) as [+,+,+,-,-]
- That is, the first 3 belong
to "+" tuples and the last 2 - to "-" tuples
- Then, Info([+,+,+,-,-]) =
-(3/5)*log(3/5)-(2/5)*log(2/5) (logs are base 2)
- 3/5 and 2/5 are relative
frequencies (probabilities)
- Ignoring the order of the
values, we can use the following notation: [3,2] meaning 3 values from
one class and 2 - from the other.
- Then, Info([3,2]) =
-(3/5)*log(3/5)-(2/5)*log(2/5)
- Information in a split (2/5
and 3/5 are weight coefficients):
- Info([+,+],[+,-,-]) = (2/5)*Info([+,+])
+ (3/5)*Info([+,-,-])
- Or, Info([2,0],[1,2]) =
(2/5)*Info([2,0]) + (3/5)*Info([1,2])
- Method:
- Sort the values;
- Calculate information in all
possible splits;
- Choose the split that
minimizes information;
- Do not include breakpoints
between values belonging to the same class (this will increase
information);
- Apply the same to the
resulting intervals until some stopping criterion is satisfied.
- Generating concept hierarchies:
recursively applying partitioning or discretization methods.
No comments:
Post a Comment