Apriori algorithm
- Advantages: Easy Code Implementation
- Cons: May be slow on large data sets
- Applicable data type: numerical or nominal type
Algorithm process:
Correlation analysis is a task of finding interesting relationships in a large-scale data set, where there are two interesting relationships: frequent itemsets (frequent item sets) or association Rules (Association).
Support: The level of support for an item set is defined as the proportion of records in that set in the dataset.
Confidence level (confidence): The confidence level of association rule A->b is expressed as support (A, B)/support (A)
There's 2^n-1 a combination of simple, violent things.
Apriori principle: If an item set is frequent, its set of children is also frequent.
In turn, it means that if an item is not frequent, then the item containing it is not a frequent item.
Here are two main processes:
1. Generate a frequent itemsets:
This is a very simple process is two sets of C, l back and forth, C is through the primary collection (like the most primitive ah, the combination of AH); L is a collection that is filtered through the support level. The process is generally as follows:
1. Build a collection of individual items based on the original data set C1
2. Calculate L1 According to C1
3. Find the C2 of the L1 can be merged
4. Repeat C2, L2, C3->.....->ck, above
2. Derivation of the Association rules:
With the frequent itemsets from the previous step, we just need to list the rules that can be listed in each frequent item set, and then calculate the confidence level and choose the confidence level that meets the requirements.
Function:
loadDataSet()
import datasets, datasets contain multiple lists, each list is an item set
createC1(dataSet)
Create a C1, extract all the individual items, the reason for using frozenset here is to use this as the dictionary key later.
scanD(D, Ck, minSupport)
Filter out CK that does not meet the minimum support level, return the satisfied LK and minimum support
apprioriGen(Lk, k)
Combine lk to get ck+1, where you can reduce the number of iterations by comparing only the first k-1 elements. For example Merge {0,1},{0,2},{1,2} merge, only need to judge once on the line
apriori(dataSet, minsupport=0.5)
Combine the above functions to complete the process. End condition is no longer able to produce a new set of items
generateRules(L, supportData, minConf=0.7)
Generate the main function of the association rule, starting with a frequent itemsets containing two items
calcConf(freqSet, H, supportData, brl, minConf=0.7)
For a given frequent itemsets freqset and the H computational confidence that can be inferred, the association rules are obtained
rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7)
The difference here is that H can become more complex, for example there are now {1,2,3}-->{1}{2}, where we want to further combine H to get {"}" and thus more fully explore the association rules. This is a recursive process to know that the merge can no longer end.
1 #Coding=utf-82 defLoaddataset ():3 return[[1,3,4],[2,3,5],[1,2,3,5],[2,5]]4 defcreteC1 (dataSet):5C1 = []6 forTransactioninchDataSet:7 forIteminchTransaction:8 if[Item] not inchC1:9 c1.append ([item])Ten C1.sort () One returnmap (FROZENSET,C1) A defScand (D, Ck, minsupport): -sscnt = {} - forTidinchD: the forAainchCk: - ifCan.issubset (tid): - ifSscnt.has_key (CAN): -Sscnt[can] + = 1 + Else: -Sscnt[can] = 1 +NumItems =float (len (D)) ARetlist = [] atSupportdata = {} - forKeyinchsscnt: -SUPPRT = Sscnt[key]/NumItems - ifSupprt >=Minsupport: - retlist.append (Key) -Supportdata[key] =supprt in returnRetlist,supportdata - defApprigen (lk,k): toRetlist = [] +LENLK =Len (Lk) - forIinchRange (LENLK): the forJinchRange (i+1, LENLK): *L1 = List (Lk[i]) [: k-2]#First k-1 $L2 = List (Lk[i]) [: k-2]Panax Notoginseng L1.sort () - L2.sort () the ifL1 = =L2: +Retlist.append (lk[i) |Lk[j]) A returnretlist the defApriori (DataSet, minsupport=0.5): +C1 =creteC1 (DataSet) -Bmap (set, DataSet) $L1, Supportdata = Scand (d,c1,minsupport=0.7) $L =[L1] -k=2 - whileLen (L[k-2]) >0: theCk = Apprigen (l[k-2], k) -Lk, SUPK =Scand (D, Ck, Minsupport)Wuyi supportdata.update (SUPK) the l.append (Lk) -K + = 1 Wu returnL,supportdata - defGeneraterules (L, Supportdata, minconf=0.7): AboutBigrules = [] $ forIinchRange (1,len (L)):#from the beginning with two - forFreqsetinchL[i]: -H1 = [Frozenset ([item]) forIteminchFreqset] - if(i>1):#number of frequent itemsets elements greater than 2 A rulesformconseq (freqset,h1,supportdata,bigrules,minconf) + Else: the calcconf (freqset,h1,supportdata,bigrules,minconf) - returnBigrules $ defCalcconf (Freqset, H, supportdata,brl,minconf=0.7): thePrunedh = [] the forConseqinchH: theconf = Supportdata[freqset]/supportdata[freqset-Conseq] the PrintSupportdata[freqset], Supportdata[freqset-Conseq] - ifConf >=minconf: in PrintFreqset-conseq,' -', Conseq,'conf', Conf theBrl.append ((freqset-conseq,conseq,conf)) the prunedh.append (CONSEQ) About returnPrunedh the defRulesfromconseq (freqset,h,supportdata,brl,minconf=0.7): them =Len (h[0]) the ifLen (freqset) > M+1: +HMP1 = Apprigen (h,m+1) -HMP1 =calcconf (freqset,hmp1,supportdata,brl,minconf) the ifLen (HMP1) >1:Bayi rulesfromconseq (freqset,hmp1,supportdata,brl,minconf) the defMain (): theDataSet =Loaddataset () -L,supportdata = Apriori (DataSet, minsupport=0.7) - PrintL theRules = Generaterules (l,supportdata,minconf=0.7) the Printrules the the if __name__=='__main__': -Main ()
From for notes (Wiz)
Correlation Mining and Aprioir algorithm