Apriori algorithm is a basic algorithm of big data in association rules. The association rule Mining algorithm was proposed by Rakesh Agrawal and Ramakrishnan Srikant two PhD in 1994. The purpose of association rules is to find out the relationship between items and items in a data set, also known as shopping blue analysis, because "Shopping blue analysis" aptly expresses a subset that applies to the algorithm scenario.
There is a very famous story about this algorithm: "Diapers and beer". The story is this: American women often instruct their husbands to buy diapers for their children after work, and the husband buys their favorite beer after the diaper, so there are plenty of opportunities for beer and diapers to be bought together. The move has increased the volume of diapers and beer sales and has been a delight for many businesses.
"1" Some concepts and definitions
Repository (Transaction database): A recordset that stores two-dimensional structures. Defined as: D
All itemsets: A collection of all items. Defined as: I.
Record (Transaction): A record in a database. Defined as: t,t∈d
Itemsets (itemset): The collection of items that appear at the same time. Defined as: K-itemset (K-itemsets), K-itemset? T. Unless otherwise noted, K, which appears below, represents the number of items.
Support: Defined as supp (x) = occur (x)/count (D) = P (x).
1. Explanation One: For example, the draft contest, that support and this is a bit similar, so many people (database), how many people choose (support) you, that is the degree of support;
2. Explanation Two: In 100 people to go to the supermarket to buy things, which buy apples there are 9 people, that is to say that Apple's support here is 9,9/100;
3. Explanation Three: P (x), meaning the probability that event X appears;
4. EXPLANATION Four: The association rules are the Absolute support (number) and relative support degree (percentage) of the points.
Confidence level (CONFIDENCE/STRENGTH): Defined as conf (x->y) = Supp (x∪y)/supp (X) = P (y| X).
In the historical data, we have bought the support degree of XXX (for example: A, b) and the degree of support for a in a mining rule (for example: A=>b), that is, the proportion of people who bought A and B and those who have bought a, this is the confidence level of a recommendation B (a=>b confidence) </ Span>
Candidate set (Candidate Itemset): The set of items that are derived by merging down. Defined as C[k].
Frequent set (frequent itemset): The set of items with a support level greater than or equal to a specific minimum support level (Minimum support/minsup). expressed as l[k]. Note that subsets of frequent sets must be frequent sets.
Lift ratio (lift lift): Lift (x-y) = lift (y-X) = conf (x-y)/supp (y) = conf (y-X)/supp (x) = P (x and Y)/(P (x) p (y) )
After the analysis of association rules, the higher the ratio, the better the ratio of selling (according to a rule) to some people than the blindly selling (generally the whole data), we call this rule strong rule;
Pruning step
Only if the subset is a candidate set of frequent sets is the frequent set, the process of screening is pruning step;
"2" Apriori Optimization: Fp-tree algorithm
Key points: In the form of a tree to display and express the form of data, can be understood as the flow of water in the branches of different rivers;
Several points for generating fp-tree:
Scan the original item set;
arranging data;
Create the root node;
The flow of elements according to the arranged data;
Node +1;
"3" Apriori Optimization: Vertical Data distribution
Key: The equivalent of the original data for row to column operation, and record the number of each element
"4" Summary: Research of association Rules algorithm in data mining
The Apriori core algorithm process is as follows:
Single-pass Scan database D calculates the support of each 1-item set, resulting in a collection of frequently 1 itemsets.
Connection step: In order to generate, pre-generated, by 2 only one of the different frequency sets of the belong to do A (k-2) join operation obtained.
Pruning step: Because of the superset, there may be some elements that are not frequent. If a subset of a potential K-itemsets is not a member in, then that latent frequent itemsets cannot be frequent and can be removed from it.
By scanning the database D in a single pass, the support degree of each item set in the calculation is eliminated, and the itemsets that do not meet the support degree are removed.
Iterate through the loop, repeat steps, until there is an R value that makes it empty, and the algorithm stops. Each element in the pruning step needs to be validated in the transaction database to determine if it joins, and the validation process here is a bottleneck for the algorithm's performance. This method requires multiple scans of a potentially large transaction database. The possibility of generating a large number of candidate sets, as well as the potential need for a duplicate scan of the database, are two major drawbacks of the Apriori algorithm.
At present, almost all the efficient parallel data mining algorithms for discovering Association rules are based on the Apriori algorithm, and Agrawal and Shafer propose three parallel algorithms: Counting distribution (count distribution) algorithm, data distribution Distribution) algorithm and candidate distribution (candidate distribute) algorithm.
"5" Summary
Examples are not for the time being, because these examples on the Internet are also relatively many;
There are many and many metrics, such as: Lift/interest, All-confidence, Consine, conviction, Jaccard, Leverage, collective strength, and so on.
Frequent data model classification is more, need to slowly understand;
Apriori algorithm of Data Mining Association rules