Efficient and scalable methods for mining frequent patterns:
Frequent
patterns are item sets, subsequences, or substructures that appear in a data
set with frequency no less than a user-specified threshold. For example, a set
of items, such as milk and bread that appear frequently together in a
transaction data set is a frequent
item set. A subsequence, such as buying first a PC, then a digital
camera, and then a memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern. A substructure can refer to different
structural forms, such as sub graphs, sub trees, or sub lattices, which may be
combined with item sets or subsequences. If a substructure occurs frequently in
a graph database, it is called a (frequent)
structural pattern. Finding
frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data. Moreover, it helps in data
indexing, classification, clustering, and other data mining tasks as well.
Thus, frequent pattern mining has become an important data mining task and a
focused theme in data mining research.
Frequent
pattern mining was first proposed by Agrawal for market basket analysis in the
form of association rule mining. It analyses customer buying habits by finding
associations between the different items that customers place in their
“shopping baskets”. For instance, if customers are buying milk, how likely are
they going to also buy cereal (and what kind of cereal) on the same trip to the
supermarket? Such information can lead to increased sales by helping retailers
do selective marketing and arrange their shelf space. Since the first proposal
of this new data mining task and its associated efficient mining algorithms,
there have been hundreds of follow-up research publications, on various kinds of
extensions and applications, ranging from scalable data mining methodologies,
to handling a wide diversity of data types, various extended mining tasks, and
a variety of new applications. With over a decade of substantial and fruitful
research, it is time to perform an overview of this flourishing field and
examine what more to be done in order to turn this technology a cornerstone
approach in data mining applications.
The concept of
frequent item set was first introduced for mining transaction databases
(Agrawal et al. 1993). Let I = {i1, i2...
in} be
a set of all items. A k-item
set α, which consists of k items from I, is frequent if α occurs
in a transaction database D no
lower than θ|D| times, where θ is a user-specified Frequent
pattern mining: current status and future directions 57 minimum support threshold (called min_sup in our
text), and |D| is
the total number of transactions in D.
There are three basic frequent item set mining methodologies: Apriori, FP-growth
and Eclat the concepts of closed and
maximal frequent item sets and their
related algorithms are examined. Techniques for mining closed frequent item
sets from very high dimensional data sets and mining very long patterns (thus
called colossal patterns).
Basic mining methodologies: apriori, FP-growth and
eclat:
1.
Apriori principle, apriori algorithm and its extensions
Since there are
usually a large number of distinct single items in a typical transaction
database, and their combinations may form a very huge number of item sets, it
is challenging to develop scalable methods for mining frequent item sets in a
large transaction database. Agrawal and Srikant observed an interesting downward closure property, called
Apriori, among frequent kite
sets: A k-item set is frequent only if
all of its sub-item sets are frequent. This implies that frequent item sets
can be mined by first scanning the database to find the frequent 1-itemsets,
then using the frequent 1-itemsets to generate candidate frequent 2-itemsets,
and check against the database to obtain the frequent 2-itemsets. This process
iterates until no more frequent k-item
sets can be generated for some k.
This is the essence of the Apriori algorithm and its alternative.
2.
Mining frequent itemsets without candidate generation
In many cases,
the Apriori algorithm significantly reduces the size of candidate sets using
the Apriori principle. However, it can suffer from two-nontrivial costs: (1)
generating a huge number of candidate sets, and (2) repeatedly scanning the database
and checking the candidates by pattern matching. FP-growth method that mines
the complete set of frequent item sets without candidate generation. FP-growth
works in a divide-and-conquer way.
The first scan of the database derives a list of frequent items in which items
are ordered by frequency descending order. According to the
frequency-descending list, the database is compressed into a frequent-pattern
tree, or FP-tree, which retains
the item, set association information. The FP-tree is mined by starting from
each frequent length-1 pattern (as an initial suffix pattern), constructing its
conditional pattern base (a
“subdatabase”, which consists of the set of prefix paths in the FP-tree co-occurring
with the suffix pattern), then constructing its conditional FP-tree, and
performing mining recursively on such a tree. The pattern growth is achieved by
the concatenation of the suffix pattern with the frequent patterns generated from
a conditional FP-tree. The FP-growth algorithm transforms the problem of
finding long frequent patterns to searching for shorter ones recursively and
then concatenating the suffix. It uses the least frequent items as a suffix,
offering good selectivity. Performance studies demonstrate that the method
substantially reduces search time.
3.
Mining frequent itemsets using vertical data format
Both the Apriori and FP-growth methods
mine frequent patterns from a set of transactions in horizontal data format (i.e.,
{TID: item set}), where TID is
a transaction- id and item set is
the set of items bought in transaction TID.
Alternatively, mining can also be performed with data presented in vertical data format (i.e., {item: TID_set}).
EquivalenceCLASSTransformation (Eclat) algorithm by exploring the vertical data
format. The first scan of the database builds the TID_set of each single item.
Starting with a single item (k = 1),
the frequent (k+1)-item sets grown from a previous k-item set can be generated according
to the Apriori property, with a depth-first computation order similar to
FP-growth (Han et al. 2000). The computation is done by intersection of the
TID_sets of the frequent k-item
sets to compute the TID_sets of the corresponding (k+1)-item
sets. This process repeats, until no frequent item sets or no candidate item sets
can be found. Besides taking advantage of the Apriori property in the
generation of candidate (k + 1)-item set from frequent k-item sets, another merit of this
method is that there is no need to scan the database to find the support of (k + 1)-item sets (for k ≥ 1). This is
because the TID_set of each k-item
set carries the complete information required for counting such support.
No comments:
Post a Comment