Data Mining algorithm-apriori Algorithm (association Rules)
Apriori algorithm is a basic algorithm in association rules. The association rule Mining algorithm was proposed by Rakesh Agrawal and Ramakrishnan Srikant two PhD in 1994. The purpose of association rules is to find out the relationship between items and items in a data set, also known as shopping blue analysis, because "Shopping blue analysis" aptly expresses a subset that applies to the algorithm scenario.
There is a very famous story about this algorithm: "Diapers and beer". The story is this: American women often instruct their husbands to buy diapers for their children after work, and the husband buys their favorite beer after the diaper, so there are plenty of opportunities for beer and diapers to be bought together. The move has increased the volume of diapers and beer sales and has been a delight for many businesses.
I. Some concepts and definitions 1. Defining 1 Items and itemsets
Set I={i1,i2,..., im} is a collection of M different projects, each IK (k=1,2,......,m) is called a project (item).
The collection of items I is called a Project collection (Itemset), referred to as the itemsets. The number of elements is called the length of the itemsets, and the itemsets of length k are called K-itemsets (K-itemset).
2. Define 2 transactions
Each trade T (Transaction) is a subset of itemsets I, that is ti, but usually ti.
Each transaction has a unique identifier-the transaction number, which is recorded as Tid
The whole of the transaction constitutes the transaction database D, or the transaction set D, referred to as the transaction set D.
The number of transactions included in the transaction set D is recorded as | d|.
3. Define the support level of the 3 itemsets
For Itemsets X,xi, set the Count (XT) to the number of transactions containing x in the transaction set D
The support (x) of the itemsets x is the probability that the itemsets x appears, thus describing the importance of x.
4. Define minimum support and frequent sets for 4 itemsets
Discovers that the association rule requires the minimum support threshold that the itemsets must meet, called the minimum support level for itemsets (Minimum supports), as supmin.
An itemsets with a support degree greater than or equal to Supmin is called a frequent itemsets, or a frequent set, or a non-frequent set.
Usually K-itemsets, if satisfied with Supmin, are called K-frequent sets, are recorded as LK.
5. Define 5 Association Rules
Association Rules (Association rule) can be represented as one implication:
Among them:.
For example: R: Milk → Bread
6. Define the support level for 6 Association rules
The support level of rule r is the ratio of the number of trades that contain both x and Y in the trading set to the number of all trades.
For example: in 5 records, there are 2 records of both orange juice and coke. This rule is supported by 2/5=0.4, which is support (A-〉B) =p (AB).
7. Define the confidence level of the 7 association rule
The confidence level of rule R (Confidence) is the ratio of the number of trades containing x and Y to the number of trades containing X
For example: Calculate the confidence level of "if Orange is Coke". Only 2 of the 4 deals containing "orange juice" contain "cola". Its confidence level is 0.5.
8. Define the minimum support and minimum confidence level of 8 Association rules
The minimum Support for association rules, which is the minimum support for measuring frequent sets (Minimum supports), is recorded as Supmin, which is used to measure the minimum importance that a rule needs to meet.
The minimum confidence level (Minimum Confidence) of association Rules is confmin, which represents the minimum reliability that association rules need to meet.
9. Define the 9 Strong association rules
, the association rule x= y is a strong association rule, otherwise the association rule x= y is a weak association rule.
When mining Association rules, the Association rules produced are measured by supmin and confmin, and the strong association rules can be used to guide the decision-making of the merchant. 10. Candidate set (Candidate Itemset): The set of items that are derived by merging down. Defined as C[k].
11. Frequent set (frequent itemset): The set of items with a support level greater than or equal to a specific minimum support level (Minimum support/minsup). expressed as l[k]. Note that subsets of frequent sets must be frequent sets.
12. Lift ratio (lift lift): Lift (x-y) = lift (y-X) = conf (x-y)/supp (y) = conf (y-X)/supp (x) = P (x and Y)/(P (x) p (Y))
After the analysis of association rules, the higher the ratio, the better the ratio of selling (according to a rule) to some people than the blindly selling (generally the whole data), we call this rule strong rule;
13. Pruning Step : only if the subset is a frequent set of candidate sets is the frequent set, the process of screening is pruning step.
Two. Apriori Algorithm (association Rules) dynamic Demo (click to download ppt watch) three. Apriori Algorithm (association Rules) algorithm description
The method adopted by the Apriori algorithm is: The first generation of frequent 1-itemsets L1, and then L1 through self-connection, pruning generation L2, frequent 2-itemsets L2 and used to generate L3, and so on, iteration by layer, until the new frequent itemsets can not be generated. Then, based on the given minimum confidence, the association rules are generated using the generated frequent itemsets.
1. Generate frequent itemsets the process can be achieved by the following steps:
1) in the first phase, all individual items are candidate set C1. Any item with a lower support value than a given minimum support value will be removed from the candidate set C1, resulting in frequent 1-itemsets L1.
2) Two L1 form a candidate set C2 with 2 items by self-join. The degree of support for these candidates is determined by scanning the database again. A candidate that retains a large amount of support than the pre-given minimum is retained, forming a frequent 2-itemsets L2.
3) The next step is to form a candidate set of 3 items C3, repeating the above steps until all the frequent itemsets are found.
The pseudo-code description of the Apriori algorithm is as follows:
Input: Data Set D,min_sup
Output: Frequent itemsets in D L
2. Generating Association Rules
After mining frequently all frequent itemsets from transaction database D, it is easy to obtain the corresponding association rules, that is, to satisfy the credibility? The frequent itemsets of Min_conf produce Strong association rules. Because rules are generated by frequent itemsets, each rule automatically satisfies min_sup.
The pseudo-code associated with the frequent item X is as follows:
Input: yk,lk,min_conf
Output: Association rules for shape like x= "Y"
--------------------------------------------------------------------------------------------------------------- ----------------------
Finish
Reproduced must be reproduced in the words, the original author and the original post address.
Data Mining algorithm-apriori Algorithm (association Rules)