Associative
Classification: Classification by Association Rule Analysis
Frequent
patterns and their corresponding association or correlation rules characterize
interesting relationships between attribute conditions and class labels, and
thus have been recently used for effective classification. Association rules
show strong associations between attribute-value pairs (or items) that occur frequently in a
given data set. Association rules are commonly used to analyze the purchasing
patterns of customers in a store. Such analysis is useful in many
decision-making processes, such as product placement, catalog design, and
cross-marketing. The discovery of association rules is based on
frequent itemset mining. Many methods for frequent itemset mining and the
generation of association rules were described in Chapter 5. In this section,
we look at associative classification, where association rules are generated
and analyzed for use in classification. The general idea is that we can search
for strong associations between frequent patterns (conjunctions of
attribute-value pairs) and class labels.
Association
rules are mined in a two-step process consisting of frequent itemset mining, followed by rule generation. The first step searches for patterns of
attribute-value pairs that occur repeatedly in a data set, where each
attribute-value pair is considered an item.
The resulting attributevalue pairs formfrequent
itemsets. The second step analyzes the frequent itemsets in order to
generate association rules. All association rules must satisfy certain criteria
regarding their “accuracy” (or confidence)
and the proportion of the data set that they actually represent (referred to as
support). For example, the
following is an association rule mined from a data set, D, shown with its confidence and support.
age =
youth^credit = OK )buys computer = yes [support = 20%, confidence
= 93%]
where “^”
represents a logical “AND.”We will say more about confidence and support in a
minute.
More formally,
letDbe a data set of tuples.
Each tuple inDis described by n attributes, A1, A2, : : : , An, and a class label attribute, Aclass. All continuous attributes are
discretized and treated as categorical attributes. An item, p, is an attribute-value pair of the
form (Ai, v), where Ai is an attribute taking a value, v. A data tuple X =
(x1, x2, : : : , xn)
satisfies an item, p = (Ai, v), if and only if xi =
v, where xi is the value of the ith attribute of X. Association rules can have any
number of items in the rule antecedent (left-hand side) and any number of items
in the rule consequent (right-hand side). However, when mining association
rules for use in classification, we are only interested in association rules of
the form p1 ^ p2 ^: : : pl ) Aclass = C where the rule antecedent is a
conjunction of items, p1, p2, : : : , pl (l _ n), associated with a class label, C. For a given rule, R, the percentage of tuples in D satisfying the rule antecedent that
also have the class label C is
called the confidence of R.
From a classification point of view, this is akin to rule accuracy. For
example, a confidence of 93% for Association Rule means that 93% of the
customers in D who are young
and have an OK credit rating belong to the class buys computer = yes. The percentage of tuples inDsatisfying the rule antecedent and
having class label C is called
the support of R. A support of
20% for Association Rule means that 20% of the customers in D are young, have an OK credit
rating, and belong to the class buys
computer = yes.
CMAR
(Classification based on Multiple Association Rules) differs from CBA in its
strategy for frequent itemset mining and its construction of the classifier. It
also employs several rule pruning strategies with the help of a tree structure
for efficient storage and retrieval of rules. CMAR adopts a variant of the FP-growth algorithm to find the complete
set of rules satisfying the minimum confidence and minimum support
thresholds.FP-growth uses a tree structure, called an FP-tree, to register all of the frequent itemset information
contained in the given data set, D.
This requires only two scans of D.
The frequent itemsets are then mined from the FP-tree. CMAR uses an enhanced
FP-tree that maintains the distribution of class labels among tuples satisfying
each frequent itemset. In this way, it is able to combine rule generation
together with frequent itemset mining in a single step.
CMAR employs
another tree structure to store and retrieve rules efficiently and to prune
rules based on confidence, correlation, and database coverage.Rule pruning
strategies are triggered whenever a rule is inserted into the tree. For
example, given two rules, R1
and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) _ conf(R2), then R2 is
pruned. The rationale is that highly specialized rules with low confidence can
be pruned if a more generalized version with higher confidence exists. CMAR
also prunes
rules for which the rule antecedent and class are not positively correlated,
based on a 2 test of statistical significance.
No comments:
Post a Comment