ASSOCIATION RULE MINING:
Association Rule Mining,
Single-Dimensional Boolean Association Rules from Transactional Databases,
Multi-Level Association Rules from Transaction Databases
3.1. ASSOCIATION RULE MINING:
Frequent patterns are
patterns (such as itemsets, subsequences, or substructures) that appear in a
data set frequently. Frequent pattern mining searches for recurring
relationships in a given data set.
For example, a set of items, such as milk and bread,
that appear frequently together in a transaction data set is a frequent itemset.
A subsequence,
such as buying first a PC, then a digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent) sequential pattern.
A substructure
can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured
pattern.
3.1.1.
Market Basket Analysis: A Motivating Example
Market basket analysis is just one form of frequent
pattern mining. Frequent itemset mining leads to the discovery of associations
and correlations among items in large transactional or relational data sets.
The discovery of interesting correlation
relationships among huge amounts of business transaction records can help in
many business decision-making processes, such as catalog design,
cross-marketing, and customer shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customers.
For
Example ,
Suppose, as manager of an AllElectronics branch, you would like to learn more about
the buying habits of your customers. Market basket analysis can also help
retailers plan which items to put on sale at reduced prices. If customers tend
to purchase computers and printers together, then having a sale on printers may
encourage the sale of printers as well as computers.
If we think of the universe as the set of items
available at the store, then each item has a Boolean variable representing the
presence or absence of that item. Each basket can then be represented by a
Boolean vector of values assigned to these variables. The Boolean vectors can
be analyzed for buying patterns that reflect items that are frequently associated
or purchased together. These patterns can be represented in the form of association rules.
For
example, the information that customers who purchase
computers also tend to buy antivirus software at the same time is represented
in Association Rule below:
Rule support
and confidence are two measures of
rule interestingness. They respectively reflect the usefulness and certainty of
discovered rules.
A support
of 2% for Association Rule means that 2%
of all the transactions under analysis show that computer and antivirus
software are purchased together.
A confidence
of 60% means that 60% of the customers who purchased a computer also bought the
software. Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold.
Such thresholds can be set by users or domain experts. Additional analysis can
be performed to uncover interesting statistical correlations between associated
items.
3.1.2.
Frequent Itemsets, Closed Itemsets, and Association Rules :
Let I ={ I1, I2, : : : , Im } be
a set of items
D, the task-relevant
data
T , transaction .
TID, Each transaction is associated with an
identifier, called TID.
Let
A be a set of items. A transaction T is said to contain A if
and only if
.
An association rule is an implication of the form A=>B,
where
and
The
rule A=>B holds in the transaction set D with support s,
where s is the percentage of transactions in D that contain AUB
(i.e., the union of sets A and B, or say, both A and
B). This is taken to be the probability, P(AUB).
The rule A=> B has confidence c in
the transaction set D, where c is the percentage of transactions
in D containing A that also contain B. This is taken to be
the conditional probability, P(B|A). That is,
Rules that satisfy both a minimum support threshold
(min sup) and a minimum confidence threshold (min conf) are
called strong. By convention, we
write support and confidence values so as to occur between 0% and 100%, rather
than 0 to 1.0.
A set of items is referred to as an itemset. An
itemset that contains k items is a k-itemset. The set {computer,
antivirus software} is a 2-itemset.
The occurrence frequency of an itemset is
the number of transactions that contain the itemset. This is also known,
simply, as the frequency, support count,
or count of the itemset.
Note
that the itemset support defined in above support equation is sometimes
referred to as relative support, whereas the occurrence frequency is
called the absolute support.
If
the relative support of an itemset I satisfies a pre specified minimum support threshold (i.e., the
absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent
itemset. The set of frequent k-itemsets is commonly denoted by Lk.
From above confidence equation,
The above equation shows that the confidence of rule
A)B can be easily derived from the support counts of A and
AUB. Thus the problem of mining association rules can be reduced to that
of mining frequent itemsets.
In
general, association rule mining can be viewed as a two-step process: (
Measures)
1. Find
all frequent itemsets: By definition, each of these
itemsets will occur at least as
frequently
as a predetermined minimum support count, min sup.
2. Generate
strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum confidence.
A major challenge in mining frequent itemsets from a
large data set is the fact that such mining often generates a huge number of
itemsets satisfying the minimum support (min sup) threshold, especially
when min sup is set low. This is because if an itemset is frequent, each
of its subsets is frequent as well. A long itemset will contain a combinatorial
number of shorter, frequent sub-itemsets. The total number of frequent itemsets
that it contains is thus,
Closed Item Set :
An itemset X is closed in a data set S if there exists no proper
super-itemset5 Y such that Y has the same support count as X in
S. An itemset X is a closed frequent itemset in set S if X
is both closed and frequent in S.
3.1.3.
Frequent Pattern Mining: A Road Map
Market basket analysis is just one form of frequent
pattern mining. Frequent pattern mining can be classified in various ways,
based on the following criteria:
Ø Based on the completeness of
patterns to be mined:
§
Complete
set of frequent itemsets,
§
Closed
frequent itemsets, and
§
Maximal
frequent itemsets.
§
Constrained
frequent itemsets (i.e., those that satisfy a set of user-defined
constraints),
§
Approximate
frequent itemsets (i.e., those that derive only
approximate support counts for the mined
frequent itemsets),
§
Near-match
frequent itemsets (i.e., those that tally the support
count of the near or almost matching
itemsets),
Ø Based on the levels of
abstraction involved in the rule set:
Some
methods for association rule mining can find rules at differing levels of
abstraction.
Ø Based on the number of data
dimensions involved in the rule:
An
association rule reference only one dimension is called single-dimensional association rule.
If
a rule references two or more dimensions, such as the dimensions age, income,
and buys, then it is a multidimensional
association rule.
Ø
Based
on the types of values handled in the rule:
If
a rule involves associations between the presence or absence of items, it is a Boolean association rule.
If
a rule describes associations between quantitative items or attributes, then it
is a quantitative association rule.
Ø Based on the kinds of rules to
be mined:
Frequent
pattern analysis can generate various kinds of rules and other interesting
relationships. Association rules are
the most popular kind of rules generated from frequent patterns.
The
discovered associations can be further analyzed to uncover statistical
correlations, leading to correlation rules.
Ø
Based
on the kinds of patterns to be mined:
ü
Frequent
itemset mining, that is, the mining of frequent
itemsets (sets of items) from transactional or relational data sets.
ü
Sequential
pattern mining searches for frequent subsequences in
a sequence data set, where a sequence records an ordering of events.
ü
Structured
pattern mining searches for frequent substructures in
a structured data set. Notice that structure is a general concept
that covers many different kinds of structural forms, such as graphs, lattices,
trees, sequences, sets, single items, or combinations of such structures.
3.2.
Efficient and Scalable Frequent Itemset Mining Methods :
Apriori,
the basic algorithm for finding frequent itemsets.
3.2.1.
The APRIORI ALGORITHM: Finding Frequent Itemsets Using Candidate Generation :
Apriori is a seminal algorithm proposed by R.
Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean
association rules. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.Next, L1 is used
to find L2, the set of frequent 2-itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can be found. The finding
of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise
generation of frequent itemsets, an important property called the Apriori
property, presented below, is used to reduce the search space.
Apriori
property: All nonempty subsets of a frequent itemset must
also be frequent.
The
Apriori property is based on the following observation. By definition, if an
itemset I does not satisfy the
minimum support threshold, min sup, then I is not frequent; that
is, P(I) < min sup. If an item A is added to the
itemset I, then the resulting itemset (i.e., I U A) cannot occur
more frequently than I. Therefore, I U A is not frequent either;
that is, P(I U A) < min sup.
This property belongs to a special category of
properties called antimonotone in
the sense that if a set cannot pass a test, all of its supersets will fail
the same test as well. It is called antimonotone because the
property is monotonic in the context of failing a test .
A two-step
process is followed, consisting of join
and prune actions.
1. The join step: To find Lk, a set of
candidate k-itemsets is generated by joining Lk1 with
itself. This set of candidates is denoted Ck. Let l1 and l2
be itemsets in Lk-1. The notation li[ j] refers to the jth
item in li (e.g., l1[k-2] refers to the second to the last
item in l1). By convention, Apriori assumes that items within a transaction
or itemset are sorted in lexicographic order. For the (k-1)-itemset, li,
this means that the items are sorted such that li[1] < li[2]
< : : : < li[k-1]. The join, Lk-1
Lk-1,
is performed, where members of Lk-1
are joinable if their first (k-2) items are in common. That is, members l1
and l2 of Lk-1 are joined if (l1[1] = l2[1]) ^ (l1[2]
= l2[2]) ^: : :^(l1[k-2] = l2[k-2]) ^(l1[k-1]
< l2[k-1]). The condition l1[k-1] <
l2[k-1]
simply ensures that no duplicates are generated. The resulting itemset formed
by joining l1 and l2 is l1[1], l1[2], : : : , l1[k-2],
l1[k-1], l2[k-1].
2. The prune step: Ck is a
superset of Lk, that is, its members may or may
not be frequent, but all of the frequent k-itemsets are included in Ck.
Ascan of the database to determine the count of each candidate in Ck
would result in the determination of Lk (i.e., all candidates
having a count no less than the minimum support count are frequent by definition,
and therefore belong to Lk). Ck, however,
can be huge, and so this could involve heavy computation.
For
example,
Table for Transactional data for an AllElectronics
branch.
There
are nine transactions in this database, that is, |D| = 9. We use above to illustrate the Apriori algorithm for
finding frequent itemsets in D.
1. In the first
iteration of the algorithm, each item is a member of the set of candidate
1-itemsets,
C1. The algorithm simply scans all of the transactions in order to count
the
number of occurrences of each item.
2. Suppose that the
minimum support count required is 2, that is, min sup = 2. (Here,
we
are referring to absolute support because we are using a support count.
The corresponding
relative
support is 2/9 = 22%). The set of frequent 1-itemsets, L1, can then
be
determined. It consists of the candidate 1-itemsets satisfying minimum support.
In
our example, all of the candidates in C1 satisfy minimum support.
3. To discover the
set of frequent 2-itemsets, L2, the algorithm uses the join L1
L1
to
generate
a candidate set of 2-itemsets, C2. C2 consists of
2-itemsets. Note that no candidates are
removed fromC2 during the prune step because each subset of the
candidates is also frequent.
4. Next, the
transactions in D are scanned and the support count of each candidate
itemset
inC2is
accumulated, as shown in the middle table of the second row .
5. The set of
frequent 2-itemsets, L2, is then determined, consisting of
those candidate
2-itemsets
in C2 having minimum support.
6. The generation
of the set of candidate 3-itemsets,C3, is detailed in Figure . From the
join
step, we first getC3 =L2
L2 = {{I1, I2, I3}, {I1,
I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2,
I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a
frequent
Itemset
must also be frequent, we can determine that the four latter candidates cannot
possibly
be frequent. We therefore remove them fromC3, thereby saving
the effort of
unnecessarily
obtaining their counts during the subsequent scan of D to determine L3.
Note
that when given a candidate k-itemset, we only need to check if its (k-1)-subsets
are
frequent since the Apriori algorithm uses a level-wise search strategy. The
resulting
pruned
version of C3 is shown in the first table of the bottom row .
7. The transactions
in D are scanned in order to determine L3, consisting
of those candidate 3-itemsets in C3 having minimum support .
8. The algorithm
uses L3
L3
to generate a candidate set of 4-itemsets, C4. Although
the
join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2,
I3, I5}} is not frequent. Thus, C4 = Ф, and the algorithm
terminates, having found all of the frequent itemsets.
Pseudo Code for Apriori Algorithm :
Step 1 of Apriori finds the frequent 1-itemsets, L1.
In steps 2 to 10, Lk-1 is used to generate candidates Ck in
order to find Lk for k>= 2. The apriori gen
procedure generates the candidates and then uses the Apriori property to
eliminate those having a subset that is not frequent (step 3). This procedure
is described below. Once all of the candidates have been generated, the
database is scanned (step 4). For each transaction, a subset function is used
to find all subsets of the transaction that are candidates (step 5), and the
count for each of these candidates is accumulated (steps 6 and 7). Finally, all
of those candidates satisfying minimum support (step 9) form the set of
frequent itemsets, L (step 11). A procedure can then be called to
generate association rules from the frequent itemsets.
The apriori
gen procedure performs two kinds of actions, namely, join and prune, as
described
above. In the join component, Lk-1 is joined with Lk-1
to generate potential candidates (steps 1 to 4). The prune component (steps 5
to 7) employs the Apriori property to remove candidates that have a subset that
is not frequent. The test for infrequent subsets is shown in procedure has
infrequent subset.
3.2.2. Generating Association Rules from Frequent Itemsets :
Once the frequent itemsets from transactions in a
database D have been found, it is straightforward to generate strong
association rules from them .
The conditional probability is expressed in terms of
itemset support count, where support count(AUB) is the number of
transactions containing the itemsets AUB, and support count(A)
is the number of transactions containing the itemset A. Based on this
equation, association rules can be generated as follows:
For each frequent itemset l, generate all
nonempty subsets of l.
For every nonempty subset s of l,
output the rule “s => (l -s)”
Generating association rules. Let’s try an example
based on the transactional data for AllElectronics shown . Suppose the
data contain the frequent itemset l = {I1, I2, I5}. What are the
association rules that can be generated from l? The nonempty subsets of l
are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting
association rules are as shown below, each listed with its confidence:
I1^I2=>I5,
confidence = 2=4 = 50%
I1^I5=>I2,
confidence = 2=2 = 100%
I2^I5=>I1,
confidence = 2=2 = 100%
I1=>I2^I5,
confidence = 2=6 = 33%
I2=>I1^I5,
confidence = 2=7 = 29%
I5=>I1^I2,
confidence = 2=2 = 100%
If
the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules above are output, because these are the only ones generated that are
strong. Note that, unlike conventional classification rules, association rules
can contain more than one conjunct in the right-hand side of the rule.
3.2.2.
Mining Frequent Itemsets without Candidate Generation :
The Apriori candidate generate-and-test method
significantly reduces the size of candidate sets, leading to good performance
gain. However, it can suffer from two nontrivial costs:
ü It
may need to generate a huge number of candidate sets.
ü It
may need to repeatedly scan the database and check a large set of candidates by
pattern matching.
An interesting method in this attempt is called
frequent-pattern growth, or simply FP-growth, which adopts a divide-and-conquer
strategy as follows. First, it compresses the database representing
frequent items into a frequent-pattern tree, or FP-tree, which retains the
itemset association information. It then divides the compressed database into a
set of conditional databases (a special kind of projected database),
each associated with one frequent item or “pattern fragment,” and mines each
such database
separately.
You’ll see how it works with the following example.
The first scan of the database is the same as
Apriori, which derives the set of frequent items (1-itemsets) and their support
counts (frequencies). Let the minimum support count be 2. The set of frequent
items is sorted in the order of descending support count. This resulting set or
list is denoted L. Thus, we have L ={{I2: 7}, {I1: 6},
{I3: 6},{I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First,
create the root of the tree, labeled with “null.” Scan database D a
second time. The items in each transaction are processed in L order
(i.e., sorted according to descending support count), and a branch is created
for each transaction. For example, the scan of the first transaction, “T100:
I1, I2, I5,” which contains three items (I2, I1, I5 in L order), leads
to the construction of the first branch of the tree with three nodes, {I2: 1}, {I1:1},
and {I5: 1}, where I2 is linked as a child of the root, I1 is linked to I2, and
I5 is linked to I1. The second transaction, T200, contains the items I2 and I4
in L order, which would result in a branch where I2 is linked to the
root and I4 is linked to I2. However, this branch would share a common prefix,
I2, with the existing path for T100. Therefore,we instead increment the count
of the I2 node by 1, and create a new node,<I4: 1>,which is linked as a
child of {I2: 2}. In general, when considering the branch to be added for a
transaction, the count of each node along a common prefix is incremented by 1,
and nodes for the items following the prefix are created and linked accordingly.
To facilitate tree traversal, an item header table
is built so that each item points to its occurrences in the tree via a chain of
node-links. The tree obtained after scanning all of the transactions is shown
in Figure 5.7 with the associated node-links. In this way, the problem of
mining frequent patterns in databases is transformed to that of mining the FP-tree.
Fig. An FP-tree registers
compressed, frequent pattern information.
The FP-tree is mined as follows. Start from each
frequent length-1 pattern (as an initial suffix pattern), construct its
conditional pattern base (a “sub database,” which consists of the set of prefix
paths in the FP-tree co-occurring with the suffix pattern), then construct
its (conditional) FP-tree, and perform mining recursively on such a
tree. The pattern growth is achieved by the concatenation of the suffix pattern
with the frequent patterns generated from a conditional FP-tree.
3.2.3.
Improving the Efficiency of Apriori :
ü Hash-based technique
(hashing itemsets into corresponding
buckets): A hash-based technique can be used to reduce the size of the
candidate k-itemsets, Ck, for k > 1.
For example, when scanning each transaction in the
database to generate the frequent 1-itemsets, L1, from the candidate
1-itemsets in C1, we can generate all of the 2-itemsets for each
transaction, hash (i.e., map) them into the different buckets of a hash
table structure, and increase the corresponding bucket counts.
A 2-itemset whose corresponding bucket count in the
hash table is below the support threshold cannot be frequent and thus should be
removed from the candidate set.
Such a hash-based technique may substantially reduce
the number of the candidate k-itemsets examined (especially when k =
2).
ü Transaction reduction (reducing the number of
transactions scanned in future iterations): A transaction
that does not contain any frequent k-itemsets cannot contain any
frequent (k+1)-itemsets. Therefore, such a transaction can be marked or
removed from further consideration because subsequent scans of the database for
j-itemsets, where j > k, will not require it.
ü Partitioning
(partitioning the data to find candidate
itemsets): A partitioning technique can be used that requires just two
database scans to mine the frequent itemsets.
It consists of two phases. In Phase I, the algorithm
subdivides the transactions of D into n non overlapping
partitions. If the minimum support threshold for transactions in D is min
sup, then the minimum support count for a partition is min sup_the
number of transactions in that partition.
·
local
frequent itemsets :
All
frequent itemsets within the partition are found. These are referred to as
local frequent itemsets.
Any
itemset that is potentially frequent with respect to D must occur as a frequent
itemset in at least one of the partitions.
·
Global
candidate itemsets :
The
collection of frequent itemsets from all partitions forms the global candidate
itemsets with respect to D.
In Phase II, a second scan of D is
conducted in which the actual support of each candidate is assessed in order to
determine the global frequent itemsets. Partition size and the number of
partitions are set so that each partition can fit into main memory and
therefore be read only once in each phase.
Fig.Mining by partitioning the
data.
ü Sampling
(mining on a subset of the given data):
The basic idea of the sampling approach is to pick a random sample S of
the given data D, and then search for frequent itemsets in S instead
of D.
ü Dynamic itemset counting (adding candidate itemsets at
different points during a scan): A dynamic itemset
counting technique was proposed in which the database is partitioned into
blocks marked by start points.
3.2.4. Mining Frequent Itemsets Using Vertical Data Format :
Both the Apriori and FP-growth methods mine frequent
patterns from a set of transactions in TID-itemset format (that is, {TID : itemset} ), where TID is a transaction-id
and itemset is the set of items bought in transaction TID. This
data format is known as horizontal data
format.
Alternatively, data can also be presented in item-TID
set format (that is, {item : TID
set}), where item is an item name, and TID set is
the set of transaction identifiers containing the item. This format is known as
vertical data format.
The vertical data format of the transaction data set
D of above table.
Mining can be performed on this data set by
intersecting the TID sets of every pair of frequent single items. The minimum
support count is 2.
Fig.
The 2-itemsets in vertical data format.
Fig.
The 3-itemsets in vertical data format
To further reduce the cost of registering long TID
sets, as well as the subsequent costs of intersections, we can use a technique
called diffset, which keeps track of only the differences of the TID
sets of a (k+1)-itemset and a corresponding k-itemset. For
instance, in Example 5.6we have {I1} = {T100, T400, T500, T700, T800, T900} and
{I1, I2} = {T100, T400, T800, T900}.
The diffset between the two is diffset({I1,
I2}, {I1}) = {T500, T700}.Thus, rather than recording the four TIDs that make
up the intersection of {I1} and {I2}, we can instead use diffset to record just
two TIDs indicating the difference between {I1} and {I1, I2}. Experiments show
that in certain situations, such as when the data set contains many dense and
long patterns, this technique can substantially reduce the total cost of vertical
format mining of frequent itemsets.
3.2.5. Mining Closed Frequent Itemsets :
Pruning strategies include the following:
·
Item
merging: If every transaction containing a frequent
itemset X also contains an itemset
Y
but not any proper superset of Y, then X [Y forms a
frequent closed itemset and there is no need to search for any itemset
containing X but no Y.
·
Sub-itemset
pruning: If a frequent itemset X is a proper subset of an
already found frequent
closed
itemset Y and support count(X) = support count(Y),
then X and all of X’s descendants in the set enumeration tree cannot be
frequent closed itemsets and thus can
be
pruned.
·
Item
skipping: In the depth-first mining of closed itemsets, at
each level, there
will
be a prefix itemset X associated with a header table and a projected database.
If a local frequent item p has the same support in several header tables at
different levels, we can safely prune p from the header tables at higher levels.
When a new frequent itemset is derived, it is
necessary to perform two kinds of closure
checking:
(1) superset checking, which checks if this new frequent itemset is a
superset of some already found closed itemsets with the same support, and (2) subset
checking, which checks whether the newly found itemset is a subset of an
already found closed itemset with the same support.
If we adopt the item merging pruning method
under a divide-and-conquer framework, then the superset checking is actually
built-in and there is no need to explicitly perform superset checking.
Great post dear. It definitely has increased my knowledge on Python. Please keep sharing similar write ups of yours. You can check this too for Python tutrial as i have recorded this recently on Python. and i'm sure it will be helpful to you.https://www.youtube.com/watch?v=gXb9ZKwx29U
ReplyDelete