Practical notes for machine learning 9 (Apriori algorithm)

Source: Internet
Author: User

The Apriori algorithm also belongs to unsupervised learning, which emphasizes "what can be found from data X ". Finding hidden relationships between items from a large-scale data set is called association analysis or association rule learning. The main problem here is that finding different combinations of items is a very time-consuming task, and the computing cost is very high. The problem cannot be solved by brute force search. Therefore, we will introduce how to use the Orio algorithm to solve the above problems.

1: simple concept description

(1) frequent item set: a set of items that frequently appear in one item. Association rules imply a strong relationship between two items. (Here we define the threshold value in advance. exceeding this threshold value proves that there is a strong relationship between the two ).

(2) The support of an item set is defined as the proportion of records in the data set that contain the item set. We need to define a minsupport in advance, but only keep the set of items that meet the minimum support.

(3) credibility or confidence level (confidence) is defined for a association rule such as {diapers}-> {wine.

(4) The principle of Apriori is that if an item set is frequent, its subsets are also frequent. Conversely, if an item set is non-frequent, all its supersets are non-frequent. For example, if the number of occurrences of {1, 2} is less than the minimum support (not frequent), the combination of {0, 1, 2} in the superset is certainly not frequent. This includes discovering frequent item sets and mining association rules.

2: frequent item set discovered

The process starts from C1 ={{ 0}, {1}, {2}, {3}, and then generates L1, l1 is the support level of the item set in C1. For example, L1 = {0}, {1}, {3 }}. Then, C2 ={{ 01}, {03}, {13} are obtained by L1 combination }}. Continue until ck is empty.

# Load data def loaddataset (): return [[, 4], [, 5], [,], [] # create c1def createc1 (Dataset ): c1 = [] for transaction in Dataset: For item in transaction: If not [item] In C1: c1.append ([item]) c1.sort () return map (frozenset, C1) # frozenset can use a set as a dictionary key word # generate lkdef Scand (D, CK, minsupport) from CK: sscnt ={} for TID in D: For can in CK: if can. issubset (TID): If not sscnt. has_key (CAN): sscnt [can] = 1 else: sscnt [can] + = 1 numitems = float (LEN (d )) retlist = [] supportdata ={} for key in sscnt: Support = sscnt [Key]/numitems if support >=minsupport: retlist. insert (0, key) # insert any new collection supportdata [Key] = support return retlist in the first part of the list. supportdata # the Apriori algorithm # CK + 1def aprien (LK, k): retlist = [] lenlk = Len (lk) for I in range (lenlk): For J in range (I + 1, lenlk ): l1 = List (lk [I]) [: K-2]; L2 = List (lk [I]) [: K-2] l1.sort (); l2.sort () If L1 = L2: retlist. append (lk [I] | LK [J]) return retlistdef Apriori (dataset, minsupport = 0.5): C1 = createc1 (Dataset) d = map (set, dataset) L1, supportdata = Scand (D, C1, minsupport) L = [L1] K = 2 while (LEN (L [K-2])> 0): ck = apriorigen (L [K-2], k) LK, supk = Scand (D, CK, minsupport) supportdata. update (supk) L. append (lk) K + = 1 return l, supportdata



Note: (1) C1 is a set of all candidate item sets with a size of 1.

(2) The frozenset type of python is used here. Frozenset refers to a set that is frozen. It means they cannot be changed, that is, users cannot modify them. Here, frozenset must be used instead of the set type, because these sets must be used as Dictionary keys later. frozenset can be used to achieve this, but set cannot.

3: discovering association rules from frequent items

# Def generaterules (L, supportdata, minconf = 0.7): # supportdata is a dict coming from Scand bigrulelist = [] For I in range (1, len (L): # only get the sets with two or more items for freqset in L [I]: h1 = [frozenset ([item]) for item in freqset] if (I> 1): rulesfromconseq (freqset, H1, supportdata, bigrulelist, minconf) else: calcconf (freqset, H1, supportdata, bigrulelist, minconf) return bigrulelist def calcconf (freqset, H, supportdata, DPM, minconf = 0.7): prunedh = [] # create new list to return for conseq in H: conf = supportdata [freqset]/supportdata [freqset-conseq] # calc confidence if conf> = minconf: Print freqset-conseq, '-->', conseq, 'conf :', conf. append (freqset-conseq, conseq, conf) prunedh. append (conseq) return prunedhdef rulesfromconseq (freqset, H, supportdata, DPM, minconf = 0.7): m = Len (H [0]) if (LEN (freqset)> (m + 1): # Try further merging hmp1 = export origen (H, m + 1) # create HM + 1 new candidates hmp1 = calcconf (freqset, hmp1, supportdata, PCIe, minconf) if (LEN (hmp1)> 1): # need at least two sets to merge rulesfromconseq (freqset, hmp1, supportdata, DPM, minconf)

4: Use the FP-growth algorithm to efficiently discover frequent item sets

Each time a frequent item set is added, the Apriori algorithm rescanned the entire dataset. When the dataset is large, This significantly reduces the speed of frequent item set discovery. The FP-growth tree only needs to traverse the database twice, which can significantly speed up the frequent item set. However, this algorithm cannot be used to discover association rules.

The first time, the number of occurrences of all element items is counted, which is only used to count the frequency of occurrence. The second scan only considers the frequent elements to build the FP tree.

#-*-Coding: cp936-*-# create the data structure of the FP tree class treenode: def _ init _ (self, namevalue, numoccur, parentnode): Self. name = namevalue self. count = numoccur self. nodelink = none self. parent = parentnode self. children = {} def Inc (self, numoccur): Self. count + = numoccur def disp (self, IND = 1): Print ''' * ind, self. name, '', self. count for child in self. children. values (): child. disp (IND + 1) # load data def loadsimpd At (): simpdat = [['R', 'z', 'h', 'J', 'P'], ['Z', 'y ', 'X', 'w', 'V', 'U', 't', 's'], ['Z'], ['R', 'x ', 'N', 'O','s '], ['y', 'R', 'x', 'z', 'Q','t ', 'P'], ['y', 'z', 'x', 'E', 'E', 'Q', 's','t ', 'M'] Return simpdatdef createinitset (Dataset): retdict ={}for trans in Dataset: retdict [frozenset (trans)] = 1 return retdict # Build FP tree def createtree (dataset, MINSUP = 1): headertable ={} for trans in Dataset: # Calculate the frequency of each element fo R item in trans: headertable [item] = headertable. get (item, 0) + dataset [Trans] For k in headertable. keys (): # Remove the element item if headertable [k] <MINSUP: del [headertable [k] freqitemset = set (headertable. keys () If Len (freqitemset) = 0: Return None, none # If no data item meets the requirements, exit for K in headertable: headertable [k] = [headertable [K], none] rettree = treenode ('null set', 1, none) for transet, count in datas ET. items (): # Sort the elements in each transaction according to the Global frequency. locald ={} for item in transet: If item in freqitemset: locald [item] = headertable [item] [0] If Len (locald)> 0: ordereditems = [V [0] for V in sorted (locald. items (), Key = Lambda P: P [1], reverse = true)] # Use the sorted frequency item set to fill the tree with updatetree (ordereditems, rettree, headertable, count) return rettree, headertabledef updatetree (items, intree, headertable, count): If items [0] In Intree. children: intree. children [items [0]. INC (count) else: intree. children [items [0] = treenode (items [0], Count, intree) If headertable [items [0] [1] = none: headertable [items [0] [1] = intree. children [items [0] else: updateheader (headertable [items [0] [1], intree. children [items [0]) If Len (items)> 1: updatetree (items [1:], intree. children [items [0], headertable, count) # Call the updatetree function de for the remaining element iterations F updateheader (nodetotest, targetnode): While (nodetotest. nodelink! = None): nodetotest = nodetotest. nodelink nodetotest. nodelink = targetnode



Practical notes for machine learning 9 (Apriori algorithm)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.