Referring to data mining, our first reaction was before the story of beer and diaper heard that this story is a typical data mining association rule. The difference between the main differences between traditional linear regression of basket analysis and the correlation analysis of discrete data;
Common Association Rules:
Association rules: Milk = egg "support =2%, confidence =60%"
Support: 2% of all transactions in the analysis the same time the purchase of milk and eggs, you need to set the domain value, to limit the generation of rules.
Confidence: purchased the milk of the cheese 60% also purchased eggs, you need to set the domain value, to limit the production of rules.
Minimum support threshold and minimum confidence threshold: set by the miner or domain expert.
The terminology associated with correlation analysis includes:
Itemsets: A collection of items (items)
K-Itemsets: A set of items that are composed of nine items
Frequent itemsets: itemsets that meet the minimum support level. Frequent K-itemsets are generally recorded as LK
Strong Association rules: rules that meet minimum support threshold and minimum confidence threshold values
Next, take the two-step example. Disclosure of the practice of related analysis:
For example, there are 9 shopping baskets (t100-t900): Two-step first to find out all the frequent itemsets, and the second step to generate strong association rules by frequent itemsets.
Algorithm steps:
STEP1: Scan D, count each candidate, generate candidate 1-itemsets C1. and calculate the correlation degree count of each item (that is, the frequency of the occurrence of the item); STEP2: Defines a minimum support threshold of 2 (that is, an item that rejects a frequency below 2), remembering that the remaining itemsets are L1. STEP3: A new 2-itemsets C2 is generated by pairing L1 22. STEP4: Scan D. For each item count in the C2, define a minimum support threshold of 2 (that is, the item with the culling frequency below 2), remembering that the remaining itemsets are 2-itemsets L2;STEP5: A new 3-itemsets C3 is generated by pairing L2 22. ...... This loops until the largest n-itemsets end appears.
In the example above, the diagram steps are as follows:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvag93yxjkz2u=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
For example, we calculate the frequent itemsets {i1,i2,i5}. Be able to discover i1^i2=>i5 because {I1,I2,I5} has appeared 2 times. {I1,I2} appears 4 times, so the confidence level is 2/4=50%
Similar to be able to figure out:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvag93yxjkz2u=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
Using R for Shopping basket analysis, R for the correlation analysis function is arules, we take the built-in groceries data set (for example, below).
Inspect (Groceries)
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvag93yxjkz2u=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
Detailed r language implementations such as the following:
Library (arules) data (groceries) Frequentsets=eclat (Groceries,parameter=list (support=0.05,maxlen=10)) Inspect (sort (frequentsets,by= "Support") [1:10]) #依据支持度对求得的频繁项集排序
The results are as follows: The ranking of all associated rules is visible:
Next, we select the relevant items we need by threshold:
Rules=apriori (Groceries,parameter=list (support=0.01,confidence=0.5)) inspect (rules)
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvag93yxjkz2u=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
Thus the shopping basket is complete, among which lift is the correlation index, lift=1 means L and R are independent, the larger the lift indicates that L and R in the same shopping basket are not occasional phenomena, more support our shopping basket decision.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvag93yxjkz2u=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
R language and data analysis ten: Shopping basket analysis