When it comes to data mining, our first reaction is the story of beer and urine that we've heard before, which is a typical association rule in data mining. The main difference between shopping basket analysis and traditional linear regression is that the correlation analysis is aimed at discrete data;
Common Association Rules:
Association rules: Milk = egg "Support degree =2%, =60% confidence"
Support: 2% of all transactions in the analysis at the same time purchased milk and eggs, you need to set the domain value, to limit the generation of rules;
Confidence: purchased the milk of the cheese 60% also purchased eggs, you need to set the domain value, to limit the production of rules;
Minimum support threshold and minimum confidence threshold: set by the miner or domain expert.
Technical terms related to relevance analysis include:
Itemsets: A collection of items (items)
K-Itemsets: A set of items that are composed of nine items
Frequent itemsets: itemsets that meet minimum support, frequent K-itemsets are generally recorded as LK
Strong Association rules: rules that meet minimum support threshold and minimum confidence threshold values
Next, take two steps as an example, revealing the practice of correlation analysis:
There are 9 shopping baskets (t100-t900): Two-step first to find all the frequent itemsets, and the second step is to generate strong association rules by frequent itemsets.
Algorithm steps:
STEP1: Scans D, counts each candidate, generates a candidate 1-item set C1, and calculates the correlation count (that is, the frequency of the occurrence) of each item; STEP2: Defines a minimum support threshold of 2 (that is, an entry with a frequency less than 2), remembering that the remaining itemsets are l1;step3: by L1 22 pairing generates a new 2-item set C2;STEP4: Scan D, count each item in the C2, define a minimum support threshold of 2 (that is, reject the item with a frequency less than 2), and remember that the remaining itemsets are 2-itemsets L2;STEP5: A new 22-itemsets C3 is generated by pairing L2 3; So loop until the largest n-itemsets end appears;
In the example above, the plot process is as follows:
For example, we calculate the frequent itemsets {i1,i2,i5}, we can find I1^i2=>i5, because {i1,i2,i5} appears 2 times, {I1,I2} has appeared 4 times, so the confidence level is 2/4=50%
A similar can be calculated:
Using R for Shopping basket analysis, r in the correlation analysis function is arules, we use the built-in groceries data set (below);
Inspect (Groceries)
The specific R language implementation is as follows:
Library (arules) data (groceries) Frequentsets=eclat (Groceries,parameter=list (support=0.05,maxlen=10)) Inspect (sort (frequentsets,by= "Support") [1:10]) #根据支持度对求得的频繁项集排序
The results are as follows: The ranking of all association rules is visible:
Next, we select the associated items we need by threshold:
Rules=apriori (Groceries,parameter=list (support=0.01,confidence=0.5)) inspect (rules)
Thus the shopping basket is completed, of which lift is the correlation index, lift=1 represents L and R Independent, the greater the lift indicates that L and R in the same shopping basket is not an occasional phenomenon, more support our shopping basket decision.
R Language and data analysis: Shopping basket analysis