Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Associative Classification


Associative Classification: Classification by Association Rule Analysis

 

Frequent patterns and their corresponding association or correlation rules characterize interesting relationships between attribute conditions and class labels, and thus have been recently used for effective classification. Association rules show strong associations between attribute-value pairs (or items) that occur frequently in a given data set. Association rules are commonly used to analyze the purchasing patterns of customers in a store. Such analysis is useful in many decision-making processes, such as product placement, catalog design, and cross-marketing. The discovery of association rules is based on

frequent itemset mining. Many methods for frequent itemset mining and the generation of association rules were described in Chapter 5. In this section, we look at associative classification, where association rules are generated and analyzed for use in classification. The general idea is that we can search for strong associations between frequent patterns (conjunctions of attribute-value pairs) and class labels.

 

Association rules are mined in a two-step process consisting of frequent itemset mining, followed by rule generation. The first step searches for patterns of attribute-value pairs that occur repeatedly in a data set, where each attribute-value pair is considered an item. The resulting attributevalue pairs formfrequent itemsets. The second step analyzes the frequent itemsets in order to generate association rules. All association rules must satisfy certain criteria regarding their “accuracy” (or confidence) and the proportion of the data set that they actually represent (referred to as support). For example, the following is an association rule mined from a data set, D, shown with its confidence and support.

 

age = youth^credit = OK )buys computer = yes [support = 20%, confidence = 93%]

 

where “^” represents a logical “AND.”We will say more about confidence and support in a minute.

 

More formally, letDbe a data set of tuples. Each tuple inDis described by n attributes, A1, A2, : : : , An, and a class label attribute, Aclass. All continuous attributes are discretized and treated as categorical attributes. An item, p, is an attribute-value pair of the form (Ai, v), where Ai is an attribute taking a value, v. A data tuple X = (x1, x2, : : : , xn) satisfies an item, p = (Ai, v), if and only if xi = v, where xi is the value of the ith attribute of X. Association rules can have any number of items in the rule antecedent (left-hand side) and any number of items in the rule consequent (right-hand side). However, when mining association rules for use in classification, we are only interested in association rules of the form p1 ^ p2 ^: : : pl ) Aclass = C where the rule antecedent is a conjunction of items, p1, p2, : : : , pl (l _ n), associated with a class label, C. For a given rule, R, the percentage of tuples in D satisfying the rule antecedent that also have the class label C is called the confidence of R. From a classification point of view, this is akin to rule accuracy. For example, a confidence of 93% for Association Rule means that 93% of the customers in D who are young and have an OK credit rating belong to the class buys computer = yes. The percentage of tuples inDsatisfying the rule antecedent and having class label C is called the support of R. A support of 20% for Association Rule means that 20% of the customers in D are young, have an OK credit rating, and belong to the class buys computer = yes.

 

CMAR (Classification based on Multiple Association Rules) differs from CBA in its strategy for frequent itemset mining and its construction of the classifier. It also employs several rule pruning strategies with the help of a tree structure for efficient storage and retrieval of rules. CMAR adopts a variant of the FP-growth algorithm to find the complete set of rules satisfying the minimum confidence and minimum support thresholds.FP-growth uses a tree structure, called an FP-tree, to register all of the frequent itemset information contained in the given data set, D. This requires only two scans of D. The frequent itemsets are then mined from the FP-tree. CMAR uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying each frequent itemset. In this way, it is able to combine rule generation together with frequent itemset mining in a single step.

 

CMAR employs another tree structure to store and retrieve rules efficiently and to prune rules based on confidence, correlation, and database coverage.Rule pruning strategies are triggered whenever a rule is inserted into the tree. For example, given two rules, R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) _ conf(R2), then R2 is pruned. The rationale is that highly specialized rules with low confidence can be pruned if a more generalized version with higher confidence exists. CMAR

also prunes rules for which the rule antecedent and class are not positively correlated, based on a 2 test of statistical significance.

No comments:

Post a Comment