Mining Various Kinds of Association Rules :
We consider additional application requirements by
extending our scope to include
o
Mining
multilevel association rules,
o
Multidimensional
association rules, and
o
Quantitative
association rules in transactional and/or relational databases and data
warehouses.
Multilevel
association rules involve concepts
at different levels of abstraction.
Multidimensional
association rules involve more than one dimension or
predicate (e.g., rules relating what a customer buys as well as the
customer’s age.)
Quantitative
association rules involve numeric
attributes that have an implicit ordering among values (e.g., age).
3.2.1.
MINING MULTILEVEL ASSOCIATION RULES :
Association rules generated from mining data at
multiple levels of abstraction are called multiple-level
or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework.
It is difficult to find strong associations among
data items at low or primitive levels of abstraction due to the sparsity of
data at those levels. Strong associations discovered at high levels of
abstraction may represent commonsense knowledge.
Data mining systems should provide capabilities for
mining association rules at multiple levels of abstraction, with sufficient
flexibility for easy traversal among different abstraction spaces.
For
example,
Table for Task-relevant data, D.
A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher level, more general concepts. Data
can be generalized by replacing low-level concepts within the data by their
higher-level concepts, or ancestors, from a concept hierarchy.
Fig.
A concept hierarchy for AllElectronics computer items
The concept hierarchy of above figure has five levels, respectively referred to as
levels 0 to 4, starting with level 0 at the root node for all (the most general
abstraction level). Here, level 1 includes computer, software, printer&
camera, and computer accessory, level 2 includes laptop computer,
desktop computer, office software, antivirus software, . . . , and
level 3 includes IBM desktop computer, . . . , Microsoft office software,
and so on. Level 4 is the most specific abstraction level of this hierarchy. It
consists of the raw data values.
In general, a top-down strategy is employed, where
counts are accumulated for the calculation of frequent itemsets at each concept
level, starting at the concept level 1 and working downward in the hierarchy
toward the more specific concept levels, until no more frequent itemsets can be
found. Three methods are,
Ø
Using
uniform minimum support for all levels (referred to as uniform support):
The
same minimum support threshold is used when mining at each level of
abstraction.
For
example, in below figure a minimum
support threshold of 5% is used throughout (e.g., for mining from “computer”
down to “laptop computer”). Both “computer” and “laptop
computer” are found to be frequent, while “desktop computer” is not.
When a uniform minimum support threshold is used,
the search procedure is simplified. The method is also simple in that users are
required to specify only one minimum support threshold.
Fig. Multilevel mining
with uniform support.
Drawback :
1.
If the minimum support threshold is set
too high, it could miss some meaningful associations occurring at low
abstraction levels.
2. If
the threshold is set too low, it may generate many uninteresting associations
occurring at high abstraction levels
Ø
Using
reduced minimum support at lower levels (referred to as reduced support):
Each
level of abstraction has its own minimum support threshold. The deeper the
level of abstraction, the smaller the corresponding threshold is.
For example, in Figure 5.12, the minimum
support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way,
“computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
Ø
Using
item or group-based minimum support (referred to as group-based support):
Because
users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group
based minimal support thresholds when mining multilevel rules.
For example, a user could set up the minimum
support thresholds based on product price, or on items of interest, such as by
setting particularly low support thresholds for laptop computers and flash
drives in order to pay particular attention to the association patterns
containing items in these categories.
A serious side effect of mining
multilevel association rules is its generation of many redundant rules across
multiple levels of abstraction due to the “ancestor” relationships among items.
3.2.2. Mining
Multidimensional Association Rules from Relational Databases and DataWarehouses
:
The
above equation is called a single
dimensional or intra dimensional association rule because it contains a
single distinct predicate (e.g., buys)with multiple occurrences (i.e.,
the predicate occurs more than once within the rule).
Mine
association rules containing multiple predicates, such as
Association
rules that involve two or more dimensions or predicates can be referred to as multidimensional
association rules. contains three predicates (age, occupation,
and buys), each of which occurs only
once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional
association rules with no repeated predicates are called interdimensional association rules.
Mine
multidimensional association rules with repeated predicates, which contain
multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules.
where the predicate buys is repeated
Database
attributes can be categorical or
quantitative.
Categorical
attributes have a finite number of possible values, with no ordering among
the values (e.g., occupation, brand, color). Categorical
attributes are also called nominal
attributes, because their values are “names of things.”
Quantitative attributes
are numeric and have an implicit ordering among values (e.g., age, income,
price). Techniques for mining multidimensional association rules can be
categorized into two basic approaches regarding the treatment of quantitative
attributes.
Ø quantitative
attributes are discretized using predefined concept hierarchies.
This discretization occurs before mining.( mining multidimensional association rules
using static discretization of quantitative attributes).
Ø quantitative
attributes are discretized or clustered into “bins” based on the distribution
of the data. These bins may be further combined
during the mining process. The discretization process is dynamic and
established so as to satisfy some mining criteria, such as maximizing the
confidence of the rules mined.(Also
referred as (dynamic) quantitative
association rules.)
3.2.2.1.Mining
Multidimensional Association Rules Using Static Discretization of Quantitative
Attributes
The
transformed multidimensional data may be used to construct a data cube.
Data cubes are well suited for the mining of multidimensional association
rules: They store aggregates (such as counts), in multi dimensional space,
which is essential for computing the support and confidence of multidimensional
association rules.
Figure shows the lattice of cuboids
defining a data cube for the dimensions age, income, and buys.
The
base cuboid aggregates the task-relevant data by age, income, and buys;
the 2-D cuboid, (age, income), aggregates by age and income,
and so on; the 0-D (apex) cuboid contains the total number of transactions in
the task-relevant data.
Mining
Quantitative Association Rules :
Quantitative association rules are multidimensional
association rules in which the numeric attributes are dynamically discretized
during the mining process so as to satisfy some mining criteria, such as
maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to
mine quantitative association rules having two quantitative attributes on the
left-hand side of the rule and one categorical attribute on the right-hand side
of the rule. That is,
where Aquan1
and Aquan2 are tests on quantitative attribute intervals (where the
intervals are dynamically determined), and Acat tests a categorical
attribute from the task-relevant data. Such rules have been referred to as
two-dimensional quantitative association rules, because they contain two
quantitative dimensions.
An
example of such a 2-D quantitative association rule is
ARCS
(Association Rule Clustering System) :
It
is used to find the two-dimensional quantitative association rules. This
approach maps pairs of quantitative attributes onto a 2-D grid for tuples
satisfying a given categorical attribute condition. The grid is then searched
for clusters of points from which the association rules are generated.
The following
steps are involved in ARCS:
Binning:
Quantitative attributes can have a very wide range of values defining their domain.
Just think about how big a 2-D grid would be if we plotted age and income
as axes, where each possible value of age was assigned a unique
position on one axis, and similarly, each possible value of income was
assigned a unique position on the other axis! To keep grids down to a
manageable size, we instead partition the ranges of quantitative attributes into
intervals. These intervals are dynamic in that they may later be further
combined during the mining process. The partitioning process is referred to as binning.
Three
common binning strategies area as follows:
ü
Equal-width binning, where the interval
size of each bin is the same
ü
Equal-frequency binning, where each bin
has approximately the same number of tuples assigned to it,
ü
Clustering-based binning, where
clustering is performed on the quantitative attribute to group neighboring
points (judged based on various distance measures) into the same bin
Finding
frequent predicate sets: Once the 2-D array containing the
count distribution for each category is set up, it can be scanned to find the
frequent predicate sets (those satisfying minimum support) that also satisfy
minimum confidence. Strong association rules can then be generated from these
predicate sets, using a rule generation algorithm .
Clustering
the association rules: The strong association rules
obtained in the previous step are then mapped to a 2-D grid. Figure 5.14 shows
a 2-D grid for 2-D quantitative association rules predicting the condition buys(X,
“HDTV”) on the rule right-hand side, given the quantitative attributes age
and income. The four Xs correspond to the rules.
Fig.
A 2-D grid for tuples representing customers who purchase high-definition TVs.
ARCS
employs a clustering algorithm for this purpose. The algorithm scans the grid,
searching for rectangular clusters of rules.
The grid-based technique described here assumes that
the initial association rules can be clustered into rectangular regions. Before
performing the clustering, smoothing techniques can be used to help remove
noise and outliers from the data. Rectangular clusters may oversimplify the
data.
nd
aggregate functions.
i like that
ReplyDeletethanku sir.
ReplyDeleteTq sir
ReplyDelete