[end] Machine learning Combat the 11th chapter uses the Apriori algorithm to carry on the correlation analysis

Source: Internet
Author: User
Tags sort

Content of this chapter: Apriori algorithm frequent itemsets Generate association rules to generate Association rules found in polls

Finding hidden relationships between objects from a large-scale data set is called Association Analysis (Association analyst) or Association Rule Learning (Association rule Learning). Searching for different combinations of items is time consuming and computationally expensive, and brute force search does not solve this problem. Use the Apriori algorithm to solve this problem. I. Correlation analysis

Correlation analysis is a task that looks for interesting relationships in a large-scale data set. There are two forms of these relationships: frequent itemsets, association rules. Frequent itemsets (frequent item sets) are collections of items that often appear in a piece; Association Rules (Association) imply that there may be a strong relationship between the two items.

The support level of an item set is defined as the proportion of records in the dataset that contain the set. You can define a minimum support level, preserving only the set of items that meet the minimum support level.

The confidence or confidence level (confidence) is defined for an association rule. For example: {diaper, wine} support is 3/5, diaper support is 4/5, so the "diaper-and wine" credibility of 3/4=0.75.

Support and confidence are the methods used to quantify the success of correlation analysis. second, the principle of Apriori

The goal of a store is to find collections of items that are often purchased together, using the collection's support to measure how often they appear. The support degree of a collection refers to how much of a transaction contains that collection. This requires traversing the data record, and as the number of items increases, the number of traversal times can increase sharply. For datasets that contain items in N, there is a combination of 2n−1 2^n-1 itemsets, and a possible set of items in 1.26x1030 1.26\times10^{30} for stores that sell only 100 of items. For modern computers, it takes a long time to complete the operation.

Apriori can reduce the set of items of interest. Apriori principle: If an item set is frequent, then all its subsets are also frequent. Conversely, if an item set is a non-frequent set, then all its superset is also infrequent. third, using Apriori algorithm to find frequent sets

The objectives of the Association analysis are: Discovering frequent sets and discovering association rules. You first need to find the frequent itemsets before you can get the association rules.

Apriori is a method of discovering frequent sets, and its two input parameters are the minimum support degree and the data set respectively. First, the algorithm generates a list of itemsets for all individual items. Next, scan the transaction to see which itemsets meet the minimum support requirements and remove the collection that does not meet the minimum support level. Then, the remaining collection is combined to produce a set of items that contain two elements. Next, re-scan the transaction, removing the itemsets that do not meet the minimum support level. The process repeats until all itemsets are removed. 3-1 generating the candidate set

Creating a function to build the initial collection will also create a function that scans the dataset to find a subset of the transaction records. The pseudo-code for the data set scan is roughly as follows:

Tran per candidate set for each transaction in the dataset
can:
    Check if can is a subset of Tran:
    If so, increase the count of can
to each candidate set:
if its support is not less than the minimum value, The item set is retained to
return the list of all frequent itemsets  
# coding=utf-8 # Create a simple test data set Def loaddataset (): return [[1,3,4], [2,3,5], [1,2,3,5], [2,5]] # Build collection C1,C1 is all candidates of size 1
Collection of sets. def createC1 (DataSet): # C1 is an empty list used to store all non-repeating item values.
    If an item does not appear in C1, it is added to C1. # This is not a simple item, but a list of items that contain only that item.
            Python cannot create a collection with only one whole # number, so this implementation must use the list C1 = [] for transaction in Dataset:for item in transaction: If not [item] in C1:C1.append ([item]) C1.sort () # Frozenset refers to the set of "Frozen", meaning that they are immutable re Turn map (Frozenset, C1) # D: DataSet # Ck: List of candidates set # Minsupport: Minimum support degree for the set of interest Minsupport # The function returns a dictionary with support degrees for later use Def scand (D, Ck,
                Minsupport): sscnt = {} for Tid in D:for can in Ck:if Can.issubset (TID): If not Sscnt.has_key (CAN): Sscnt[can] = 1 Else:sscnt[can] + = 1 NumItems = float (len (D)) Retlis t = [] Supportdata = {} for key in sscnt: # Calculates the support degree of all Itemsets = Sscnt[key]/numitems if s
Upport >= Minsupport:            # Insert a new collection at the header of the list Retlist.insert (0, key) Supportdata[key] = support return retlist, Suppo Rtdata
>>> import Ml.apriori as Apriori
# Imports DataSet
>>> DataSet = Apriori.loaddataset ()
>> > DataSet
[[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
# Build the first set of candidate Itemsets C1
>>> C1 = APRIORI.CREATEC1 (d Ataset)
>>> C1
[Frozenset ([1]), Frozenset ([2]), Frozenset ([3]), Frozenset ([4]), Frozenset ([5])]
# The DataSet represented by the build collection D
>>> d = map (set, DataSet)
>>> D
[Set ([1, 3, 4]), set ([2, 3, 5]), set ([1 , 2, 3, 5]), set ([2, 5])]
# Remove itemsets that do not meet minimum support, 0.5 minimum support
>>> L1, suppData0 = Apriori.scand (D, C1, 0.5)
# The following four itemsets make up the L1 list, where each single item set appears at least 50% of the records
>>> L1
[Frozenset ([1]), Frozenset ([3]), Frozenset ([2] ), Frozenset ([5])]
3.2 Organization-complete Apriori algorithm

The pseudo code is as follows:

When the number of items in the collection is greater than 0 o'clock
    constructs a list of candidate itemsets consisting of K items
    to check the data to confirm that each itemsets is frequent
    and constructs a list of candidate itemsets consisting of k+1 items

Specific algorithm code:

# Create a candidate set CK # Lk, frequent itemsets list # K, number of itemsets elements def Apriorigen (Lk, K): # Create Ck # Creates an empty list retlist = [] # computes the element in Lk l
            ENLK = Len (Lk) for I in Range (LENLK): for J in Range (I+1, LENLK): # current K-2 items together, merge two collections L1 = List (Lk[i]) [: k-2] L2 = List (Lk[j]) [: K-2] L1.sort () l2.sort () if L1
                ==L2: # Python in the set and operation corresponding to the operator is | Retlist.append (lk[i) | LK[J]) return Retlist # DataSet, DataSet # Minsupport, support degree # This function generates a list of candidate sets Def Apriori (DataSet, Minsupport = 0.5): C1 = CreateC1 (dataSet) # The Map function maps set () to each item in the DataSet list D = Map (set, DataSet) L1, Supportdata = Scand (d, C1, Minsup Port) # put L1 into the l list L = [L1] k = 2 # While loop will L2, L3, L4, ...
        Put in the L list until the next large set of items is empty while (Len (l[k-2]) > 0): # call Apriorigen () to create a candidate set ck CK = Apriorigen (l[k-2], k)
        # Scan data set, get LK lk from Ck, supk = Scand (D, Ck, Minsupport) supportdata.update (SUPK)L.append (Lk) k + = 1 return L, Supportdata 

Program execution Effect:

>>> Reload (apriori)
<module ' Ml.apriori ' from ' C:\Python27\ml\apriori.pyc ' >
>>> L, Supportdata = Apriori.apriori (dataSet)
>>> L
[[Frozenset ([1]), Frozenset ([3]), Frozenset ([2]), Frozenset ([5])], [Frozenset ([1
, 3]), Frozenset ([2, 5]), Frozenset ([2, 3]), Frozenset ([3, 5])], [Frozenset ([2],
3, 5])], []]
>>> l[0]
[Frozenset ([1]), Frozenset ([3]), Frozenset ([2]), Frozenset ([5])]
>> > l[1]
[Frozenset ([1, 3]), Frozenset ([2, 5]), Frozenset ([2, 3]), Frozenset ([3, 5])]
>>> l[2]
[Frozenset ([2, 3, 5])]
>>> l[3]
[]
>>> Apriori.apriorigen (l[0], 2)
[Frozenset ([1, 3]), Frozenset ([1, 2]), Frozenset ([1, 5]), Frozenset ([2, 3]), fro
zenset ([3, 5]), Frozenset ([2, 5])]
>>> l,support = Apriori.apriori (DataSet, minsupport=0.7)
>>> L
[[Frozenset ([3]), Frozenset ([2]), Frozenset ([5]) ], [Frozenset ([2, 5])], []]
Iv. Mining Association Rules from frequent itemsets

Two important goals of association analysis are discovering frequent itemsets and Association rules . To find an association rule, start with a frequent itemsets, and the elements in the collection are not duplicates, but we want to know if other content can be obtained based on these elements. An element or a collection of elements may deduce another element. For example, a frequent itemsets {soy milk, lettuce}, there may be an association rule "Soy Milk →\to lettuce", the left set of arrows is called the front , and the set of arrows to the right is called the back piece .

Each frequent itemsets can produce many association rules, and if you can reduce the number of rules to ensure that the problem is solvable, it will be much better to calculate. If a rule does not meet the minimum confidence requirement, then all subsets of the rule do not meet the minimum confidence requirement. By using this property to reduce the number of rules that are tested, you can start with a frequent itemsets and then create a list of rules, where the right part of the rule contains only one element and then tests those rules. Next, merge all the remaining rules to create a new list of rules with two elements on the right. This method is called the grading method . Specific code:

# association rule generation function, this function calls the other two functions rulesfromconseq, calcconf # L: Frequent itemsets list # Supportdata: Dictionary with those frequent itemsets support data # minconf: Minimum confidence threshold, default is 0.7 # function last
To generate a list of rules that contain confidence, you can sort them later based on confidence # These rules are stored in bigrulelist. def generaterules (L, Supportdata, minconf=0.7): Bigrulelist = [] # iterates through each frequent itemsets in L and creates a list of only a single set of elements for each frequent itemsets H1, # because

    You cannot build association rules from a single-element item set, so you start the rule-building process from an itemsets containing two or more elements. # get only a collection with two or more elements for I in a range (1, Len (L)): Freqset in l[i]: H1 = [Frozenset ([item]) for ITE M in Freqset] if i > 1: # If the number of elements in the frequent itemsets exceeds 2, then it will be considered for further merging, merging through # RULESFROMCO
                Nseq to complete rulesfromconseq (Freqset, H1, Supportdata, Bigrulelist, minconf) Else:  # If there are only two elements in the set, then you need to use calcconf () to calculate the confidence value calcconf (Freqset, H1, Supportdata, Bigrulelist, minconf) return Bigrulelist # Evaluating rules # The goal is to calculate the confidence of the rule and find the rule that satisfies the minimum confidence requirement # The function returns a list of rules that meet the minimum confidence requirements, an empty list Prunedh save the Rules def calcconf (Freqset, H, Su Pportdata, BRL, minconf=0.7): Prunedh = [] # traverse all itemsets in H and calculate their confidence value for conseq in H: # confidence calculation when using support data in supportdata conf = supportdata[freq Set]/SUPPORTDATA[FREQSET-CONSEQ] # Rules satisfy minimum confidence values, output these rules to screen display if Conf >= minconf:print F Reqset-conseq, '---', Conseq, ' conf: ', Conf brl.append ((FREQSET-CONSEQ, Conseq, conf)) Prunedh. Append (CONSEQ) return PRUNEDH # is used to generate a set of candidate rules, generate more association rules from the initial itemsets # Freqset: Frequent itemsets # H: A list of elements that can appear in the right part of the rule Def rulesfromconseq (f Reqset, H, Supportdata, BRL, minconf=0.7): # H in frequent itemsets size M m = Len (h[0]) # See if the frequent itemsets are large to remove a subset of size m if (Len (fre Qset) > (m+1): # Generates a no-repeat combination of elements in H, the result is stored in HMP1, which is also the H-List of the next iteration Hmp1 = Apriorigen (H, m+1) # HMP1 contains all possible rules , use calcconf () to test their trustworthiness to determine if they meet the requirements HMP1 = calcconf (Freqset, HMP1, Supportdata, BRL, minconf) # If more than one rule satisfies the requirement,  Then use the HMP1 iteration call function Rulesfromconseq if (len (HMP1) > 1): Rulesfromconseq (Freqset, HMP1, Supportdata, BRL, minconf)

Actual run Effect:

>>> import Ml.apriori as Apriori >>> DataSet = Apriori.loaddataset () >>> DataSet [[1, 3, 4], [2 , 3, 5], [1, 2, 3, 5], [2, 5]] >>> l, support = Apriori.apriori (DataSet, minsupport=0.5) >>> L, SUPPORTD
ATA = Apriori.apriori (DataSet, minsupport=0.5) >>> rules = Apriori.generaterules (L, Supportdata, minconf=0.7) Frozenset ([1])-Frozenset ([3]) conf:1.0 Frozenset ([5])-Frozenset ([2]) conf:1.0 frozenset ([2])--frozen Set ([5]) conf:1.0 >>> rules [(Frozenset ([1]), Frozenset ([3]), 1.0), (Frozenset ([5]), Frozenset ([2]), 1.0), (fro
Zenset ([2]), Frozenset ([5]), 1.0)] # Result of lower confidence threshold >>> rules = Apriori.generaterules (L, Supportdata, minconf=0.5) Frozenset ([3])-Frozenset ([1]) conf:0.666666666667 frozenset ([1])-Frozenset ([3]) conf:1.0 Frozenset ([5])- -Frozenset ([2]) conf:1.0 frozenset ([2])-Frozenset ([5]) conf:1.0 Frozenset ([3])--Frozenset ([2]) conf:0 .666666666667 Frozenset ([2])--Frozenset ([3]) conf:0.666666666667 Frozenset ([5])--Frozenset ([3]) conf:0.666666666667 Frozenset ([3])--& Gt Frozenset ([5]) conf:0.666666666667 Frozenset ([5])--Frozenset ([2, 3]) conf:0.666666666667 Frozenset ([3])--FR Ozenset ([2, 5]) conf:0.666666666667 Frozenset ([2])--Frozenset ([3, 5]) conf:0.666666666667 >>> rules [(Fro Zenset ([3]), Frozenset ([1]), 0.6666666666666666), (Frozenset ([1]), Frozense t ([3]), 1.0), (Frozenset ([5]), Frozenset ([
2]), 1.0), (Frozenset ([2]), Frozenset ([5]), 1.0), (Frozenset ([3]), Frozenset ([2]), 0.6666666666666666), (Frozenset ([2] ), Frozenset ([3]), 0.6666666666666666), (Frozenset ([5]), Frozenset ([3]), 0.66666 66666666666), (Frozenset ([3]), Frozenset ([5]), 0.6666666666666666), (Frozenset ([5]), Frozenset ([2, 3]), 0.6666666666666666), (Frozenset ([3]), Frozenset ([2, 5]), 0.6666666666666666), (Frozenset ([2]), Frozenset ([3, 5]), 0.6666666666666666)]
Vi. Example: detection of similar characteristics of poisonous mushrooms

Sometimes it is not necessary to look for all frequent itemsets, but only to be interested in the set of items that contain a particular element. The sample looks for some of the common features of the poisonous mushrooms, which can be used to avoid eating the poisonous mushrooms.

>>> Import Ml.apriori as Apriori # importing data >>> Mushdataset = [Line.split () for line in open (' C:\python27\\m Ushroom.dat '). ReadLines ()] # Run the Apriori algorithm on the dataset >>> L, Suppdata = Apriori.apriori (Mushdataset, minsupport=0.3) # Search results for frequent itemsets containing toxic eigenvalues 2 >>> for item in L[1]: ... if item.intersection (' 2 '): Print Item ... frozenset ([' 2 ', ' 59 ']) frozenset ([' ['] ', ' 2 ']) frozenset ([' 2 ', ' ']) frozenset ([' 2 ', '] ') ' Frozenset ([' 2 ', '] ') ' Frozenset ([' 2 ', ' [']] ') fr
Ozenset ([' ['] ', ' 2 ']) frozenset ([' + ', ' 2 ']) frozenset ([' 2 ', ' ']) frozenset ([' ['] ', ' 2 ']) frozenset ([' 63 ', ' 2 '])     Frozenset ([' 2 ', '] ') frozenset ([' 2 ', ' "]) frozenset ([' 2 ', ' 36 ']) # Repeat the above procedure for a larger set of items >>> for item in L[3]: ... If Item.intersection (' 2 '): Print Item ... frozenset ([' 2 ', ' 2 ', ' ['] '] ') frozenset ([' ['] ', ' + ', ' + ', '] ') frozens ET ([' 2 ', ' + ', ' ['] ', ' frozenset ', ') ' ([' 2 ', ' ' + ', ' ' + ', ' frozenset ') ') frozenset ([' 39 ', ' 2 ', ' ['] ', ' ' 63 ', ' 2 ', ' Frozenset ']) ([' 39 ', ' 2 ', ' 90', ' (Frozenset ')]) ([' 2 ', ' ' + ', ' ' + ', ' frozenset ') ') ([' 2 ', ' ' + ', ' "'" ') ' Frozenset ([' 39 ', ' 2 ', ' 63 ', ' 86 ']) ... 
Vii. Summary of this chapter

Association analysis is a toolset for discovering interesting relationships among elements in a large data set that can be quantified using frequent itemsets and association rules. It is time consuming to find different combinations of elements, and using Apriori reduces the number of collections that are checked on the database. The Apriori principle is that if an element item is not frequent, then the superset containing the element is not frequent. The Apriori algorithm starts with a single-element itemsets and forms a larger set by combining itemsets that meet the minimum support requirements. The degree of support is used to measure how often a collection appears in the original data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.