A Survey on FP-Growth Algorithm in Association Rules Mining

A Survey on FP-Growth Algorithm in Association Rules Mining

Research المؤتمرات العلمية ابحاث المؤتمرات العلمية

اسم الباحث     :    Mokhtar Al-hamadi
سنة النشر     :    2017
ملخص البحث     :   

Abstract

 Association Rule Mining has become an important data mining technique and has been a focused area in research field. Its main work to find out required hidden knowledge or pattern from the large sets of data in order to be able to make intelligent decisions. The Apriori, Eclat and FP-growth algorithms are the most famous algorithms which can be used for association rule mining. In FP-Growth, there is no candidate generation and it requires only 2 passes over the database. However, the generation of FP-tree become very expansive to build. Lots of techniques have been proposed to improve the performance of FP-Growth algorithms. This survey discusses the respective characteristics of the FP-Growth Algorithm Variations. It also provides a comparative study of the FP-Growth Algorithm Variations techniques.

1- INTRODUCTION

W

ith the significant increase in data growth in all areas has become a process of analysis and the search for knowledge of the most important goals Knowledge discovery in databases (KDD) [1].Data mining is one of the stages of knowledge discovery in databases which represent a smart way to discover interesting patterns and knowledge from large amounts of data [2]. Data mining involves many functions like the anomaly detection, association rule learning ,clustering, classification, regression and summarization[3].

 

1.1  Association rules mining (ARM)

Association rules mining is a key function in data mining and the primary goal of association rules mining is to find the interesting associations or correlation relationships between item sets in big data [4]. There are various areas used association rules such as telecommunication networks, web usage mining, intrusion detection, bioinformatics, market, risk management, and inventory control.[5, 6]. Association rule mining has two main tasks, mining frequent itemsets is the primary task which can be generated by using minimum threshold support. And the second task is using these generated frequent itemsets strong association rules which have confidence above a predefined threshold value. The first step requires extensive computation time and storage and thus becomes a challenging problem in ARM [7].

1.2         Notation and concept of Association rules mining

Association mining can he stated as follows: Let I = (i1, i2,…,in) be a set of items. Let D = (T1, T2,…,Tj,…Tm) the task-relevant data, be a set of transactions in a database, where each transaction Tj(j=1,2,… ,m) such that Tj ⊆ I . Each transaction is assigned an identifier, called TID (Transaction id). Let a be a set of items, a transaction T is said to contain A if and only if A ⊆ I. An association rule is an implication of the form A→B where A ⊆ I, B ⊆ I and A∩B=∅. The rule A→B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A∪B (i.e., both A and B). This is taken to be the probability P(A∪B). The rule has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability(B|A). That is, confidence(A→B) =P(B|A) = support(A∪B) /support(A) = c, support(A→B) = P(A∪B) = s. The popular association rules Mining is to mine strong association rules that satisfy the user specified both minimum support threshold and confidence threshold. That is, minconf and minsup. If support(X) ≥ minsup, X is frequent item sets. Frequent k-itemsets is always marked as LK. If support (A→B) ≥ minsup and confidence (A→B) ≥ minconf,A→B is strong correlation [8].

 

1.3    Association Rules Mining Algorithm

There are many algorithms used with the Association Rules Mining to extract frequent patterns. the most basic algorithm for finding the frequent itemsets is the Apriori algorithm put forward by Agrawal[9]. It is a bottom-up and breadth first search approach. Apriori’s principle: If an itemset is frequent, then all of its subset must also be frequent [10] . The main idea of this algorithm is, it generates k-th candidate itemsets from the (k-1)-th frequent itemsets. It also finds the k-th frequent itemset from the k-th candidate itemsets. The algorithm gets terminated when the frequent itemsets cannot be extended further.

 But there are two drawbacks in it. First ,because it repeatedly scans the transaction database, it needs a lot of I/O load (consumes a lot of memory and CPU time); Second, it will cause huge candidate set [7].

To solve the above two problems in 2000 Han et al proposed

FP-Growth algorithm (Frequent Pattern growth)[11]. The biggest advantage of the FP-Growth algorithm is that it only scans database twice and does not generate any candidates[11],[12],[13]. Eclat [14] algorithm is a depth first search based algorithm. It uses a vertical database layout i.e. instead of explicitly listing all transactions; each item is stored together with its cover (also called tidlist) and uses the intersection based approach to compute the support of an itemset. In later years, these algorithms have been changed and provided with new structures.

رجوع