Translate

Tuesday, September 25, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Association Rule Mining


 ASSOCIATION RULE MINING:


Association Rule Mining, Single-Dimensional Boolean Association Rules from Transactional Databases, Multi-Level Association Rules from Transaction Databases

3.1. ASSOCIATION RULE MINING:

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently. Frequent pattern mining searches for recurring relationships in a given data set.

For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset.

A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.

A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.

3.1.1. Market Basket Analysis: A Motivating Example

Market basket analysis is just one form of frequent pattern mining. Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets.

The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behavior analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.

The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers.

 

For Example ,

Suppose, as manager of an AllElectronics branch, you would like to learn more about the buying habits of your customers. Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.

If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables. The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.

For example, the information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule  below:



Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules.

A support of 2% for Association Rule  means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together.

A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. Additional analysis can be performed to uncover interesting statistical correlations between associated items.

3.1.2. Frequent Itemsets, Closed Itemsets, and Association Rules :

Let I ={ I1, I2, : : : , Im } be a set of items

D, the task-relevant data

T , transaction . 

TID, Each transaction is associated with an identifier, called TID.

Let A be a set of items. A transaction T is said to contain A if and only if .

An association rule is an implication of the form A=>B, where  and  The rule A=>B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain AUB (i.e., the union of sets A and B, or say, both A and B). This is taken to be the probability, P(AUB).

The rule A=> B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B|A). That is,



Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. By convention, we write support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.

A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a 2-itemset.

The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset.

Note that the itemset support defined in above support equation is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support.

If the relative support of an itemset I satisfies a pre specified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk.

From above confidence equation,




The above equation shows that the confidence of rule A)B can be easily derived from the support counts of A and AUB. Thus the problem of mining association rules can be reduced to that of mining frequent itemsets.

In general, association rule mining can be viewed as a two-step process: ( Measures)

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as

frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

A major challenge in mining frequent itemsets from a large data set is the fact that such mining often generates a huge number of itemsets satisfying the minimum support (min sup) threshold, especially when min sup is set low. This is because if an itemset is frequent, each of its subsets is frequent as well. A long itemset will contain a combinatorial number of shorter, frequent sub-itemsets. The total number of frequent itemsets that it contains is thus,



Closed Item Set :

An itemset X is closed in a data set S if there exists no proper super-itemset5 Y such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S.

3.1.3. Frequent Pattern Mining: A Road Map

Market basket analysis is just one form of frequent pattern mining. Frequent pattern mining can be classified in various ways, based on the following criteria:

Ø  Based on the completeness of patterns to be mined:

§  Complete set of frequent itemsets,

§  Closed frequent itemsets, and

§  Maximal frequent itemsets.

§  Constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints),

§  Approximate frequent itemsets (i.e., those that derive only approximate support counts  for the mined frequent itemsets),

§  Near-match frequent itemsets (i.e., those that tally the support count of the near or almost  matching itemsets),

Ø  Based on the levels of abstraction involved in the rule set:

                        Some methods for association rule mining can find rules at differing levels of abstraction.





Ø  Based on the number of data dimensions involved in the rule:

An association rule reference only one dimension is called single-dimensional association rule.



If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule.



Ø  Based on the types of values handled in the rule:

If a rule involves associations between the presence or absence of items, it is a Boolean association rule.

If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule.

Ø  Based on the kinds of rules to be mined:

Frequent pattern analysis can generate various kinds of rules and other interesting relationships. Association rules are the most popular kind of rules generated from frequent patterns.

The discovered associations can be further analyzed to uncover statistical correlations, leading to correlation rules.

Ø  Based on the kinds of patterns to be mined:

ü  Frequent itemset mining, that is, the mining of frequent itemsets (sets of items) from transactional or relational data sets.

ü  Sequential pattern mining searches for frequent subsequences in a sequence data set, where a sequence records an ordering of events.

ü  Structured pattern mining searches for frequent substructures in a structured data set. Notice that structure is a general concept that covers many different kinds of structural forms, such as graphs, lattices, trees, sequences, sets, single items, or combinations of such structures.

3.2. Efficient and Scalable Frequent Itemset Mining Methods :

Apriori, the basic algorithm for finding frequent itemsets.

3.2.1. The APRIORI ALGORITHM: Finding Frequent Itemsets Using Candidate Generation :

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

                        The Apriori property is based on the following observation. By definition, if an itemset  I does not satisfy the minimum support threshold, min sup, then I is not frequent; that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I U A) cannot occur more frequently than I. Therefore, I U A is not frequent either; that is, P(I U A) < min sup.

This property belongs to a special category of properties called antimonotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called antimonotone because the property is monotonic in the context of failing a test .

A two-step process is followed, consisting of join and prune actions.

1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk􀀀1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[ j] refers to the jth item in li (e.g., l1[k-2] refers to the second to the last item in l1). By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. For the (k-1)-itemset, li, this means that the items are sorted such that li[1] < li[2] < : : : < li[k-1]. The join, Lk-1  Lk-1, is performed, where  members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k-2] = l2[k-2]) ^(l1[k-1] < l2[k-1]). The condition l1[k-1] <

l2[k-1] simply ensures that no duplicates are generated. The resulting itemset formed by joining l1 and l2 is l1[1], l1[2], : : : , l1[k-2], l1[k-1], l2[k-1].

2. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. Ascan of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so this could involve heavy computation.

For example,

Table for Transactional data for an AllElectronics branch.



There are nine transactions in this database, that is, |D| = 9. We use above  to illustrate the Apriori algorithm for finding frequent itemsets in D.

1. In the first iteration of the algorithm, each item is a member of the set of candidate

1-itemsets, C1. The algorithm simply scans all of the transactions in order to count

the number of occurrences of each item.

2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here,

we are referring to absolute support because we are using a support count. The corresponding

relative support is 2/9 = 22%). The set of frequent 1-itemsets, L1, can then

be determined. It consists of the candidate 1-itemsets satisfying minimum support.

In our example, all of the candidates in C1 satisfy minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1  L1 to

generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are removed fromC2 during the prune step because each subset of the candidates is also frequent.

4. Next, the transactions in D are scanned and the support count of each candidate itemset

inC2is accumulated, as shown in the middle table of the second row  .

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate

2-itemsets in C2 having minimum support.

6. The generation of the set of candidate 3-itemsets,C3, is detailed in Figure . From the

join step, we first getC3 =L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},

{I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent

Itemset must also be frequent, we can determine that the four latter candidates cannot

possibly be frequent. We therefore remove them fromC3, thereby saving the effort of

unnecessarily obtaining their counts during the subsequent scan of D to determine L3.

Note that when given a candidate k-itemset, we only need to check if its (k-1)-subsets

are frequent since the Apriori algorithm uses a level-wise search strategy. The resulting

pruned version of C3 is shown in the first table of the bottom row .



7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support .

8. The algorithm uses L3  L3 to generate a candidate set of 4-itemsets, C4. Although

the join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3, I5}} is not frequent. Thus, C4 = Ф, and the algorithm terminates, having found all of the frequent itemsets.

Pseudo Code for Apriori Algorithm :

Step 1 of Apriori finds the frequent 1-itemsets, L1. In steps 2 to 10, Lk-1 is used to generate candidates Ck in order to find Lk for k>= 2. The apriori gen procedure generates the candidates and then uses the Apriori property to eliminate those having a subset that is not frequent (step 3). This procedure is described below. Once all of the candidates have been generated, the database is scanned (step 4). For each transaction, a subset function is used to find all subsets of the transaction that are candidates (step 5), and the count for each of these candidates is accumulated (steps 6 and 7). Finally, all of those candidates satisfying minimum support (step 9) form the set of frequent itemsets, L (step 11). A procedure can then be called to generate association rules from the frequent itemsets.

 The apriori gen procedure performs two kinds of actions, namely, join and prune, as

described above. In the join component, Lk-1 is joined with Lk-1 to generate potential candidates (steps 1 to 4). The prune component (steps 5 to 7) employs the Apriori property to remove candidates that have a subset that is not frequent. The test for infrequent subsets is shown in procedure has infrequent subset.

3.2.2. Generating Association Rules from Frequent Itemsets :

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them .



The conditional probability is expressed in terms of itemset support count, where support count(AUB) is the number of transactions containing the itemsets AUB, and support count(A) is the number of transactions containing the itemset A. Based on this equation, association rules can be generated as follows:

For each frequent itemset l, generate all nonempty subsets of l.


For every nonempty subset s of l, output the rule “s => (l -s)”
min conf, where min conf is the minimum confidence threshold. For example

Generating association rules. Let’s try an example based on the transactional data for AllElectronics shown . Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are the association rules that can be generated from l? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with its confidence:

I1^I2=>I5, confidence = 2=4 = 50%

I1^I5=>I2, confidence = 2=2 = 100%

I2^I5=>I1, confidence = 2=2 = 100%

I1=>I2^I5, confidence = 2=6 = 33%

I2=>I1^I5, confidence = 2=7 = 29%

I5=>I1^I2, confidence = 2=2 = 100%

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are output, because these are the only ones generated that are strong. Note that, unlike conventional classification rules, association rules can contain more than one conjunct in the right-hand side of the rule.

3.2.2. Mining Frequent Itemsets without Candidate Generation :

The Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it can suffer from two nontrivial costs:

ü  It may need to generate a huge number of candidate sets.

ü  It may need to repeatedly scan the database and check a large set of candidates by pattern matching.

An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a frequent-pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or “pattern fragment,” and mines each such database

separately. You’ll see how it works with the following example.

The first scan of the database is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted L. Thus, we have L ={{I2: 7}, {I1: 6}, {I3: 6},{I4: 2}, {I5: 2}}.

An FP-tree is then constructed as follows. First, create the root of the tree, labeled with “null.” Scan database D a second time. The items in each transaction are processed in L order (i.e., sorted according to descending support count), and a branch is created for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of the tree with three nodes, {I2: 1}, {I1:1}, and {I5: 1}, where I2 is linked as a child of the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200, contains the items I2 and I4 in L order, which would result in a branch where I2 is linked to the root and I4 is linked to I2. However, this branch would share a common prefix, I2, with the existing path for T100. Therefore,we instead increment the count of the I2 node by 1, and create a new node,<I4: 1>,which is linked as a child of {I2: 2}. In general, when considering the branch to be added for a transaction, the count of each node along a common prefix is incremented by 1, and nodes for the items following the prefix are created and linked accordingly.

To facilitate tree traversal, an item header table is built so that each item points to its occurrences in the tree via a chain of node-links. The tree obtained after scanning all of the transactions is shown in Figure 5.7 with the associated node-links. In this way, the problem of mining frequent patterns in databases is transformed to that of mining the FP-tree.



Fig. An FP-tree registers compressed, frequent pattern information.



The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial suffix pattern), construct its conditional pattern base (a “sub database,” which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct its (conditional) FP-tree, and perform mining recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.



 

3.2.3. Improving the Efficiency of Apriori :

ü  Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1.

For example, when scanning each transaction in the database to generate the frequent 1-itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash table structure, and increase the corresponding bucket counts.

 



A 2-itemset whose corresponding bucket count in the hash table is below the support threshold cannot be frequent and thus should be removed from the candidate set.

Such a hash-based technique may substantially reduce the number of the candidate k-itemsets examined (especially when k = 2).

ü  Transaction reduction (reducing the number of transactions scanned in future iterations): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it.

ü  Partitioning (partitioning the data to find candidate itemsets): A partitioning technique can be used that requires just two database scans to mine the frequent itemsets.

It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n non overlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is min sup_the number of transactions in that partition.

·         local frequent itemsets :

All frequent itemsets within the partition are found. These are referred to as local frequent itemsets.

Any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions.

·         Global candidate itemsets :

The collection of frequent itemsets from all partitions forms the global candidate itemsets with respect to D.

In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.



Fig.Mining by partitioning the data.

ü  Sampling (mining on a subset of the given data): The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D.

ü  Dynamic itemset counting (adding candidate itemsets at different points during a scan): A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.

3.2.4. Mining Frequent Itemsets Using Vertical Data Format :

Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset format (that is, {TID : itemset} ), where TID is a transaction-id and itemset is the set of items bought in transaction TID. This data format is known as horizontal data format.

Alternatively, data can also be presented in item-TID set format (that is, {item : TID set}), where item is an item name, and TID set is the set of transaction identifiers containing the item. This format is known as vertical data format.



The vertical data format of the transaction data set D of above table.



Mining can be performed on this data set by intersecting the TID sets of every pair of frequent single items. The minimum support count is 2.



Fig. The 2-itemsets in vertical data format.



Fig. The 3-itemsets in vertical data format

To further reduce the cost of registering long TID sets, as well as the subsequent costs of intersections, we can use a technique called diffset, which keeps track of only the differences of the TID sets of a (k+1)-itemset and a corresponding k-itemset. For instance, in Example 5.6we have {I1} = {T100, T400, T500, T700, T800, T900} and {I1, I2} = {T100, T400, T800, T900}.

The diffset between the two is diffset({I1, I2}, {I1}) = {T500, T700}.Thus, rather than recording the four TIDs that make up the intersection of {I1} and {I2}, we can instead use diffset to record just two TIDs indicating the difference between {I1} and {I1, I2}. Experiments show that in certain situations, such as when the data set contains many dense and long patterns, this technique can substantially reduce the total cost of vertical format mining of frequent itemsets.

3.2.5. Mining Closed Frequent Itemsets :

Pruning strategies include the following:

·         Item merging: If every transaction containing a frequent itemset X also contains an itemset

Y but not any proper superset of Y, then X [Y forms a frequent closed itemset and there is no need to search for any itemset containing X but no Y.

·         Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found frequent

closed itemset Y and support count(X) = support count(Y), then X and all of X’s descendants in the set enumeration tree cannot be frequent closed itemsets and thus can

be pruned.

·         Item skipping: In the depth-first mining of closed itemsets, at each level, there

will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels.

When a new frequent itemset is derived, it is necessary to perform two kinds of closure

checking: (1) superset checking, which checks if this new frequent itemset is a superset of some already found closed itemsets with the same support, and (2) subset checking, which checks whether the newly found itemset is a subset of an already found closed itemset with the same support.

If we adopt the item merging pruning method under a divide-and-conquer framework, then the superset checking is actually built-in and there is no need to explicitly perform superset checking.

 

1 comment:

  1. Great post dear. It definitely has increased my knowledge on Python. Please keep sharing similar write ups of yours. You can check this too for Python tutrial as i have recorded this recently on Python. and i'm sure it will be helpful to you.https://www.youtube.com/watch?v=gXb9ZKwx29U

    ReplyDelete