shareengineer: DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Efficient and scalable methods for mining frequent patterns

Tuesday, September 25, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Efficient and scalable methods for mining frequent patterns

Efficient and scalable methods for mining frequent patterns:

Frequent patterns are item sets, subsequences, or substructures that appear in a data set with frequency no less than a user-specified threshold. For example, a set of items, such as milk and bread that appear frequently together in a transaction data set is a frequent item set. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as sub graphs, sub trees, or sub lattices, which may be combined with item sets or subsequences. If a substructure occurs frequently in a graph database, it is called a (frequent) structural pattern. Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data indexing, classification, clustering, and other data mining tasks as well. Thus, frequent pattern mining has become an important data mining task and a focused theme in data mining research.

Frequent pattern mining was first proposed by Agrawal for market basket analysis in the form of association rule mining. It analyses customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. For instance, if customers are buying milk, how likely are they going to also buy cereal (and what kind of cereal) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and arrange their shelf space. Since the first proposal of this new data mining task and its associated efficient mining algorithms, there have been hundreds of follow-up research publications, on various kinds of extensions and applications, ranging from scalable data mining methodologies, to handling a wide diversity of data types, various extended mining tasks, and a variety of new applications. With over a decade of substantial and fruitful research, it is time to perform an overview of this flourishing field and examine what more to be done in order to turn this technology a cornerstone approach in data mining applications.

The concept of frequent item set was first introduced for mining transaction databases (Agrawal et al. 1993). Let I = {i1, i2... in} be a set of all items. A k-item set α, which consists of k items from I, is frequent if α occurs in a transaction database D no lower than θ|D| times, where θ is a user-specified Frequent pattern mining: current status and future directions 57 minimum support threshold (called min_sup in our text), and |D| is the total number of transactions in D. There are three basic frequent item set mining methodologies: Apriori, FP-growth and Eclat the concepts of closed and maximal frequent item sets and their related algorithms are examined. Techniques for mining closed frequent item sets from very high dimensional data sets and mining very long patterns (thus called colossal patterns).

Basic mining methodologies: apriori, FP-growth and eclat:

1. Apriori principle, apriori algorithm and its extensions

Since there are usually a large number of distinct single items in a typical transaction database, and their combinations may form a very huge number of item sets, it is challenging to develop scalable methods for mining frequent item sets in a large transaction database. Agrawal and Srikant observed an interesting downward closure property, called Apriori, among frequent kite sets: A k-item set is frequent only if all of its sub-item sets are frequent. This implies that frequent item sets can be mined by first scanning the database to find the frequent 1-itemsets, then using the frequent 1-itemsets to generate candidate frequent 2-itemsets, and check against the database to obtain the frequent 2-itemsets. This process iterates until no more frequent k-item sets can be generated for some k. This is the essence of the Apriori algorithm and its alternative.

2. Mining frequent itemsets without candidate generation

In many cases, the Apriori algorithm significantly reduces the size of candidate sets using the Apriori principle. However, it can suffer from two-nontrivial costs: (1) generating a huge number of candidate sets, and (2) repeatedly scanning the database and checking the candidates by pattern matching. FP-growth method that mines the complete set of frequent item sets without candidate generation. FP-growth works in a divide-and-conquer way. The first scan of the database derives a list of frequent items in which items are ordered by frequency descending order. According to the frequency-descending list, the database is compressed into a frequent-pattern tree, or FP-tree, which retains the item, set association information. The FP-tree is mined by starting from each frequent length-1 pattern (as an initial suffix pattern), constructing its conditional pattern base (a “subdatabase”, which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then constructing its conditional FP-tree, and performing mining recursively on such a tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree. The FP-growth algorithm transforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix. It uses the least frequent items as a suffix, offering good selectivity. Performance studies demonstrate that the method substantially reduces search time.

3. Mining frequent itemsets using vertical data format

Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in horizontal data format (i.e., {TID: item set}), where TID is a transaction- id and item set is the set of items bought in transaction TID. Alternatively, mining can also be performed with data presented in vertical data format (i.e., {item: TID_set}). EquivalenceCLASSTransformation (Eclat) algorithm by exploring the vertical data format. The first scan of the database builds the TID_set of each single item. Starting with a single item (k = 1), the frequent (k+1)-item sets grown from a previous k-item set can be generated according to the Apriori property, with a depth-first computation order similar to FP-growth (Han et al. 2000). The computation is done by intersection of the TID_sets of the frequent k-item sets to compute the TID_sets of the corresponding (k+1)-item sets. This process repeats, until no frequent item sets or no candidate item sets can be found. Besides taking advantage of the Apriori property in the generation of candidate (k + 1)-item set from frequent k-item sets, another merit of this method is that there is no need to scan the database to find the support of (k + 1)-item sets (for k ≥ 1). This is because the TID_set of each k-item set carries the complete information required for counting such support.

shareengineer

Pages

Translate

Tuesday, September 25, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Efficient and scalable methods for mining frequent patterns

No comments:

Post a Comment