In the previous chapter, we studied the clustering of unsupervised learning, and the clustering algorithm could separate the classification of different properties. These two days to learn the Apriori algorithm for association analysis, the feeling is the most difficult to understand the chapter, and there is a very bad father in the book, the author has a great negligence.
Apriori Algorithm Association Analysis: Finding hidden relationships between objects from a large-scale data set is called Association Analysis or Association rule learning.
Correlation Analysis Application 1: We used to study the classification or regression prediction based on the characteristics, and there is no relationship between the mining characteristics, the correlation analysis can be used to analyze the relationship between the characteristics of the dataset, which features are frequently common occurrences or relationships between attributes ( For example, if a person's height and weight are converted to 012 in the ABC segment, then (0, 0), (1, 1), (2,2) may appear frequently and have a high chance of being 0--weight 0, weight 2, and height 2, This will have a small application in the end.
Correlation Analysis Application 2: is one of our most commonly used aspects, in the store we tend to have two or more items will be purchased jointly by the buyer, such as buy tomatoes will buy eggs, because to make tomatoes scrambled eggs, buy bread will buy milk, such a relationship there are many ( And the most famous beer diaper problem), stores can sell frequent combinations of goods together, which will be more promotional and store business better. The task of the store is to discover this frequently occurring product portfolio, and the Apriori algorithm we learn today can help us find this frequent combination of relationships, and correlate analysis, giving us some association rules, that is, buying tomatoes--and buying eggs, but the reverse is not necessarily true, Because buying eggs does not necessarily buy tomatoes, eggs have a lot of practice (pull away ...) )。
This algorithm has a lot of detail needs to be carefully scrutinized, a lot of difficult to understand the place, the algorithm is short but is the essence, Apriori algorithm has a great disadvantage, this we finally summed up. Next we look at the algorithm details.
Let's say we have a grocery store that operates 4 commodities (goods 0, 1, 2 and 3), which shows all possible combinations of all products:
in the context of correlation analysis, we focus only on the combination of products, not on how many pieces each item buys . We make a rule that the collection of goods A and B is more than 50% in the data set, and we think (A, B) is a frequent set. We're going to figure out if all of the sets are frequent sets. This is extremely troublesome, each one of the products its combined set number is the number of growth, we short time to calculate all the frequent set is unrealistic. Only the frequent concentration of the association rules is meaningful, infrequent combinations we can think of as accidental phenomenon, do not have the value of exploration.
The researchers found a so-called apriori principle that could help us reduce our computational capacity. The Apriori principle is that if an item set is frequent, then all its subsets are also frequent. the more common is its inverse no proposition, that is, if an item set is non-frequent, then all its superset is also infrequent .
If the product 1 is not frequent, then all contain 1 of the product mix must not be frequent, so the figure contains 1 of the combination must not be frequent set, so similar to pruning cut off a large number of unnecessary calculations, this pruning in the next two functions are reflected.
Step1:
Pruning 1: Cut the combination that does not meet the minimum support level. The default minimum support level is 0.5.
Pruning 2: Combine the build function (from each combination of M elements to m+1 elements).
There are a lot of details in pruning 2:
1: This combination function in a special way to combine a larger set, to generate a combination of k elements, then compare two k-1 elements of the first k-2 elements, if the same, you can form a new combination of k elements, if not one of these two combinations, the group is not a new combination of the two, Because one is a non-frequent set, the superset of the non-frequent set is also not a frequent set.
2: This way avoids multiple sets of duplication, such as (0,1) and (0,2) and (1, 2) will generate 3 (1, 2, 3), there is a repetition of the situation, and this way only one situation, and will not be omitted, the omission is also the case of the previous 1 is a non-frequent set.
3: If there is a case in which this is a non-frequent set but is synthesized 2 (0, 2, 3), then he will also be in the next step to calculate the support of the pass.
#计算支持度函数def Getsupport (d, CK, Minsupport): wordnum = {} for line in D: for C-ck: if C.issubset (line):
if not Wordnum.has_key (c): wordnum[c] = 1 else: wordnum[c] + = 1 dnum = float (len (d)) res = []
support = {} for key in Wordnum: tempsupport = Wordnum[key]/dnum if Tempsupport >= minsupport: Res.insert (0, key) Support[key] = Tempsupport return res, support# constructs new frequent collection def createck (Lk, k): ck = []< C18/>l = Len (lk) for I in range (L): for J in range (i + 1, l): L1 = list (Lk[i]) [: k-2] L2 = List (lk[ J]) [: k-2] l1.sort () l2.sort () if L1 = = L2: ck.append (Lk[i] | lk[j]) return CK
Step2:
Auxiliary functions (not related to the above example)
#读取简单数据def Loaddataset (): data = [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] return data# Create the associated word set def createC1 (data Set): c1 = [] for line in the dataset: For I in line : if not [i] in C1: c1.append ([i]) c1.sort ()
return map (Frozenset, C1)
Step3:
The Apriori algorithm, filtered by the hierarchy of frequent sets, each layer increments a combination element, the first layer 1 elements, the second layer 2 elements, so recursion. The default minimum support level is 0.5.
The details do not explain, it is not difficult to understand.
#apriori算法求频繁项集def Getapriori (DataSet, Minsupport = 0.5): C1 = createC1 (DataSet) d = Map (set, DataSet) L1, Support = Getsupport (d, C1, minsupport) L = [L1] k = 2 while Len (l[k-2]) > 0: ck = Createck (L[k-2], k) L2, sup = Getsupport (d, CK, Minsupport) l.append (L2) support.update (SUP) k + = 1 return L, Support
We get the frequent itemsets of the data in the Step2.
After modifying the minimum support level to 0.7, the frequent set is significantly less.
STEP4:
Before we complete the calculation of the frequent sets, we proceed with the correlation analysis:
A rule p? The credibility of H is defined as support (P | H)/support (P), where "|" Represents the assembly of P and H. The calculation of visible confidence is based on the support degree of the itemsets.
All association rules generated from the itemsets {0,1,2,3} are given, where the shaded areas give low-confidence rules. Can be found if {0,1,2}? {3} is a low-confidence rule, all other rules with 3 as the back (with 3 on the right of the arrow) are low-confidence. This requires an important explanation, and many places do not mention this, leading to inexplicable.
The reason for this is that the confidence is P (0123)/P (012) since it is a low confidence collection, then the numerator in the subset he generates is P (0123), the items in the denominator are reduced, the P value of the denominator is much larger, then the confidence level is lower, so it is a low confidence collection. This is similar to Apriori in the calculation of frequent itemsets, latter 3 does not add to the later larger set of combinations, this also uses the preceding combination function Createck (), the group is not set as a low confidence, because the composition of the two is low confidence, and does not recur, Even if the composition is also the credibility of the calculation function is pass (such as 13 low confidence, but generated 2-013, will also be pass, in line with the reason for the beginning).
The functions are:
#关联分析def Generaterules (L, support, minrules = 0.7): rules = [] L = Len (l) for I in range (1, L): for Freqse T in L[i]: H = [Frozenset ([item]) ' For item ' in Freqset] getgenerate (Freqset, H, Support, rules, MINRU Les) If i > 1:moregenerate (freqset, H, Support, rules, minrules) return rules# Find Association rules def g Etgenerate (Freqset, H, Support, rules, Minrules = 0.7): NEWH = [] for i in h:believe = Support[freqset]/su Pport[freqset-i] If believe > minrules:newH.append (i) Print freqset-i, "--", I, " Believe = ", Believe Rules.append ((Freqset-i, I, believe)) Return newh# Add Association rule Right def moregenerate (Freqset, H , support, rules, Minrules = 0.7): M = Len (h[0)) If Len (Freqset) > (M + 1): Moreh = Createck (H, M + 1) Moreh = Getgenerate (Freqset, Moreh, Support, rules, minrules) if Len (Moreh) > 1:moregenerate (fre Qset, Moreh, support, RULes, Minrules)
The author of this function book has a lot of mistakes, his generaterules () function, leaking all ABCD. Combination of X. That is, the right is 1 elements of the association rules, I changed to get the correct results, this point tangled for 2 days, wasted a lot of time, in other bloggers saw this place, confirmed the mistake, we have to trust ourselves, despise the author.
The last small example of mushrooms found frequently occurring combinations of features, ' 2 ' denotes poisonous mushrooms, found and ' 2 ' frequently occurring characteristics we all think of as poisonous mushrooms, but does not mean that the mushrooms that do not appear in frequent concentrations are not poisonous mushrooms (perhaps ' 2 ').
Frequent set of 2 sets (not shown)
Frequent set of 4 sets (not shown)
summarize the shortcomings of Apriori: We calculate the frequent itemsets each time the calculation of the support degree each set has traversed the data set, the efficiency is very low , tomorrow to learn fp-tree to calculate the frequent Itemsets method, only with a few data set traversal, it is here today, to sleep. Come on.
Machine Learning DAY16 machine learning Combat Apriori algorithm for correlation analysis