The science of extracting useful information from large data sets or databases is named as data mining. Though data mining concepts have an extensive history, the term “Data Mining“, has been recently introduced in mid 90’s. Data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas.
Association rule mining is the most important technique in the field of data mining. Association rule mining is finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
As the data generated by day-to-day activities, volume of data increases dramatically. Therefore, mining association rules from massive amounts of data in a database has become an interest for many industries which help businesses in the decision making processes, such as cross marketing, Basket data analysis, and promotion assortment. Association rules are created by analyzing data for frequent if/then patterns and using the criteria to support and the confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.
Market Basket Analysis
A typical and widely-used example of association rule mining is Market Basket Analysis.The problem is to generate all association rules that have ‘support’ and ‘confidence’ greater than the user-specified “support’ and “confidence” minimum.
Support(S) of an association rule is defined as the percentage/fraction of records that contain X∪Y to the total number of records in the database. Suppose the support of an item is 0.1%, it means only 0.1 percent of the transaction contain purchasing of this item.
Support (XY) = Support count of (XY) / Total number of transaction in D
Confidence(C) of an association rule is defined as the percentage/fraction of the number of transactions that contain X∪Y to the total number of records that contain X. Confidence is a measure of strength of the association rules, suppose the confidence of the association rule X⇒Y is 80%, it means that 80% of the transactions that contain X also contain Y together.
Most Commonly Used Algorithms Used With ARM
Apriori was proposed by Agrawal and Srikant in 1994. The algorithm finds the frequent set L in the database D. It makes use of the downward closure property. The algorithm is a bottom search, moving upward level-wise in the lattice (interaction between any combination of objects – such as resources, computers, and applications and subjects – such as individuals, groups or organizations. However, before reading the database at every level, it prunes many of the sets which are unlikely to be frequent sets, thus saving any extra effort.
Candidate Generation: Given the set of all frequent (k-1) item-sets. We want to generate a superset of the set of all frequent k-item-sets. The intuition behind the ‘aprioricandidates’ generation procedure is that if an item-set X has minimum support, so do all subsets of X. After all the (l+1)- candidate sequences have been generated, a new scan of the transactions is started (they are read one-by-one) and the support of these new candidates is determined.
Pruning: The pruning step eliminates the extensions of (k-1) item-sets which are not found to be frequent, from being considered for counting support. For each transaction t, the algorithm checks which candidates are contained in t and after the last transaction are processed; those with support less than the minimum support are discarded.
- The database is not used at all for counting the support of candidate item-sets after the first pass.
- The candidate item-sets are generated the same way as in Apriori algorithm.
- Another set C’ is generated of which each member has the TID of each transaction and the large item-sets present in this transaction. This set is used to count the support of each candidate item-sets.
a) For small problems, AprioriTid did about as well as Apriori, but performance degraded to about twice as slow for large problems.
b) During the initial passes the candidate item sets generated are very large equivalent to the size of the database. Hence the time taken will be equal to that of Apriori. And also it might incur an additional cost if it cannot completely fit into the memory.
This algorithm finds the rule according to the confirmation measures (P. A. Flach, N. Lachiche 2001). It uses first order logic representation. It includes various option like class Index, classification, confirmation Threshold, confirmation Values, frequency Threshold, horn Clauses, missing Values, negation, noise Threshold, number Literals, repeat Literals, roc Analysis, values Output etc.
a) Tertius has a relatively long runtime, which is largely dependent on the number of literals in the rules. Increasing the allowed number of ‘literals’ increases the runtime exponentially, so we want to keep the maximum to three. Even with an allowed maximum of three ‘literals’, the runtime can be still quite long – running Tertius can take up to several hours for some of our larger tests.
APRIORI HYBRID ALGORITHM
Apriori performs better than AprioriTid in the initial passes but in the later passes AprioriTid has better performance than Apriori. Due to this reason we can use another algorithm called Apriori Hybrid algorithm. In which Apriori is used in the initial passes but then switched to AprioriTid in the later passes. This switch takes time, but it is still better for most cases.
a) An extra cost is incurred when shifting from Apriori to AprioriTid.
b) Suppose at the end of K at the pass we decide to switch from Apriori to AprioriTid. Then in the (k+1) pass, after having generated the candidate sets we also have to add the Tids to C’k+1.
Association rule mining is an interesting topic of research in the field of data mining and it’s gaining popularity day by day, many researchers are digging deeper into this as it has a bright future. In this article we have covered almost all the aspects of ARM.