Correlation analysis and Pyhon implementation of Apriori algorithm

Source: Internet
Author: User

The core nature of the algorithm: all non-empty sets of frequent itemsets must also be frequent. The reverse proposition is also true: If an item set is non-frequent, then all its superset is not frequent.

First, Apriori algorithm introduction: The Apriori algorithm is a mining association rule frequent itemsets algorithm, its core idea is through the candidate set generation and the plot of the downward closed detection two stages to mining frequent itemsets. Apriori (transcendental, speculative) algorithm is widely used in the analysis of consumer market prices, to guess customers ' consumption habits, intrusion detection technology in the field of network security, can be used in the management of colleges and universities, according to the mining rules may effectively assist the school management departments to carry out poverty-oriented education It can also be used in the field of mobile communication to guide operators ' business operations and decision making of ancillary service providers.

Second, the excavation steps:

1. Find all frequent itemsets (frequency) based on the support level

2. Generate association rules based on confidence level (intensity)

Iii. Basic Concepts

For a->b

① Support:P (a ∩ b), the probability of both a and b

② confidence Level:

P (b| A)the probability p (AB)/P (a) for the simultaneous occurrence of B in the event of a (a) e.g. shopping basket analysis: Milk? Bread

Example: [ support degree:3%, confidence:40%]

Support 3%: means 3% customers buy milk and bread at the same time

Confidence 40%: means customers who buy milk 40% also buy bread

③ if event a contains k elements, then the event A is called K itemsets event a events that meet the minimum support threshold are called frequent k the set of items.

④ rules that meet both the minimum support threshold and the minimum confidence threshold are called strong rules

Iv. Steps of implementation

     Apriori algorithm is one of the most influential algorithms for mining Boolean Association rules frequent itemsets apriori Using an iterative approach called layer-wise search, " k-1 k

First, find out frequently "l1 l1 used to find frequent " 2 l2 l2 used to find l3 . This continues until you cannot find the " k lk All require a database scan.

The core idea is: Connecting steps and pruning steps. The connection step is self-connecting, and the principle is to ensure that the previous k-2 entries are the same and are connected in dictionary order. Pruning step is to make all non-empty sets of any frequent itemsets must also be frequent. Conversely, if a

A candidate non-empty set is not frequent, then the candidate is definitely not frequent, so it can be removed from CK .

Simply speaking, style= " Font-family: ' Times New Roman ' ">1 discovering frequent itemsets, the process is (1 ) scan ( 2 ) count ( 3 ) comparison ( 4 Generate frequent Itemsets ( 5     Repeat step (1 ) ~ ( 5

2, The Association rules, the process is: according to the previous definition of the confidence level, the resulting association rules are as follows:

(1) for each frequent itemsets l, all non-empty sets of L are produced;

(2) for Each non-empty set of L S, if

P(L)/P(S) ≧min_conf

Note:l-s represents an itemsets for removing S subsets in itemsets L


The following code implements the Apriori algorithm implementation under a simple data set:

Def loaddataset (): return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]] def createC1 (dataSet): # return C1 Frequent item                Set C1 = [] for transaction in Dataset:for item in Transaction:if not [item] in C1: C1.append ([item]) C1.sort () return map (FROZENSET,C1) # frozenset can ' t changed!def Scand (d,ck,minsupport): ss  Cnt = {} for Tid in D:for can in Ck:if Can.issubset (TID): If not Sscnt.has_key (CAN):  Sscnt[can]=1 else:sscnt[can]+=1 NumItems = float (len (D)) retlist = [] Supportdata = {} for key        In Sscnt:support = Sscnt[key]/numitems if support >= minSupport:retList.insert (0, key) supportdata[key]= support return Retlist,supportdata # return result list and support data is a mapdef Aprio Rigen (lk,k): Retlist = [] lenlk = Len (Lk) for I in Range (LENLK): for J in Range (I+1,LENLK): L1  = List (Lk[i]) [: k-2]; L2 = List (Lk[j]) [: K-2] L1.sort (); L2.sort () if L1 = = L2:retList.append (lk[i]| LK[J]) return retlistdef apriori (dataset,minsupport = 0.5): C1 = createC1 (DataSet) D = Map (set, dataset) L1,su        Pportdata = Scand (D, C1, minsupport) L = [L1] k = 2 while (len (l[k-2]) >0): Ck = Apriorigen (L[k-2], k) LK,SUPK = Scand (D, Ck, Minsupport) supportdata.update (SUPK) l.append (Lk) k+=1 return L,suppor Tdata if __name__ = = "__main__": "" DataSet = Loaddataset () print (DataSet) C1 = createC1 (DataSet) Print    (C1) D = Map (set, DataSet) print (d) l1,supportdata = Scand (d, C1, 0.5) print (L1) print (supportdata) "DataS ET = Loaddataset () L,supportdata = Apriori (dataSet) print (l[1])

Results output: [[Frozenset ([1]), Frozenset ([3]), Frozenset ([2]), Frozenset ([5])], [Frozenset ([1, 3]), Frozenset ([2, 5]), Frozenset ([2, 3]), Frozenset ([3, 5])], [Frozenset ([2, 3, 5])], []]


Correlation analysis and Pyhon implementation of Apriori algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.