Apriori algorithm of data mining and Python implementation of code sharing _python

Source: Internet
Author: User

Association Rules Mining (Association rule Mining) is one of the most active research methods in data mining, which can be used to discover the connection between things, and to discover the relationship between different goods in supermarket transaction database. (Beer and diapers)

Basic concepts

1, the definition of support: Support (x-->y) = | X-y|/n= collection x number of times/data records that occur in a record at the same time as items in set Y. For example: Support ({beer}-->{diaper}) = number of simultaneous occurrences of beer and diapers/data records = 3/5=60%.

2, the definition of self-confidence: confidence (x-->y) = | X Sex y|/| X| = Number of occurrences/set X in a record for the items in the collection X and in the set Y. For example: Confidence ({beer}-->{diaper}) = number of concurrent beers and diapers/number of times beer appears =3/3=100%;confidence ({diaper}-->{beer}) = number of concurrent beers and diapers/diaper occurrences = 3/ 4 = 75%

Rules that meet the minimum support threshold (MIN_SUP) and minimum confidence threshold (min_conf) are called strong rules, and if the item set satisfies the minimum support, it is called a frequent item set

"How can I Mining association rules from a large database?" The Mining of association rules is a two-step process:

1, find all the frequent itemsets: by definition, these itemsets appear at least as much as the predefined minimum support count.
2. Strong association rules are generated by frequent itemsets: according to the definition, these rules must meet the minimum support and minimum confidence level.

Apriori Law

In order to reduce the generation time of frequent itemsets, we should eliminate some completely impossible sets of frequent itemsets as early as possible, and the two laws of Apriori are doing this.

Apriori Law 1: If a set is a frequent set of items, all subsets of it are frequent itemsets. For example: Suppose that a set {a,b} is a frequent item set, that is, a, B the number of times in a record is greater than or equal to the minimum support Min_support, then its subset {A},{B} must appear to be greater than or equal to Min_support, that is, its subsets are frequent itemsets.

Apriori Law 2: If a set is not a frequent item set, all its superset are not frequent itemsets. For example: Assuming that the set {a} is not a frequent item set, that is, the number of occurrences of a is less than min_support, any superset of it, such as {a,b}, must be less than min_support, and therefore its superset must not also be a frequent item set.

The above diagram illustrates the process of the Apriori algorithm, noting that the three-level candidate set generated by the two-level frequent itemsets does not have {milk, bread, beer}, because {bread, beer} is not a two-class frequent itemsets, where the Apriori theorem is used. Finally, after generating the three-level frequent itemsets, there is no higher level of candidate set, so the whole algorithm ends, {milk, bread, diapers} is the largest frequent subset.

Python implementation code:

Copy Code code as follows:

Skip to Content
Sign up Sign in this repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
Branch:master datamining/apriori/apriori.py
Hackerxutaizilongxu ago Backup
1 contributor
156 lines (140 sloc) 6.302 KB rawblamehistory
#-*-Encoding:utf-8-*-
#---------------------------------Import------------------------------------
#---------------------------------------------------------------------------
Class Apriori (object):

def __init__ (self, filename, min_support, Item_start, item_end):
Self.filename = filename
Self.min_support = min_support # minimum Support degree
Self.min_confidence = 50
Self.line_num = 0 # Number of rows for item
Self.item_start = Item_start # Take the item from which line
Self.item_end = Item_end

Self.location = [[i] for I in range (Self.item_end-self.item_start + 1)]
Self.support = Self.sut (self.location)
Self.num = List (sorted (set ([J For I-Self.location for J-i])) # Record Item

Self.pre_support = [] # Save the previous one Support,location,num
Self.pre_location = []
Self.pre_num = []

Self.item_name = [] # project name
Self.find_item_name ()
Self.loop ()
Self.confidence_sup ()

def deal_line (self, line):
"Extract the required items"
return [I.strip () to I in Line.split (") if I][self.item_start-1:self.item_end]

def find_item_name (self):
"Extract item_name from the first line"
With open (Self.filename, ' R ') as F:
For Index,line in Enumerate (F.readlines ()):
If index = 0:
Self.item_name = Self.deal_line (line)
Break

def sut (self, location):
"""
enter [[1,2,3],[2,3,4],[1,3,5] ...]
Output support for each location set [123,435,234 ...]
"""
With open (Self.filename, ' R ') as F:
Support = [0] * LEN (location)
For Index,line in Enumerate (F.readlines ()):
If index = = 0:continue
# Extract every message
Item_line = Self.deal_line (line)
For Index_num,i in Enumerate (location):
Flag = 0
For j in I:
If ITEM_LINE[J]!= ' T ':
Flag = 1
Break
If not flag:
Support[index_num] + + 1
Self.line_num = index # Altogether how many lines, out the first row of Item_name
Return support

def select (Self, c):
"Back to position"
stack = []
For I in Self.location:
For J in Self.num:
If J in I:
If Len (i) = = C:
Stack.append (i)
Else
Stack.append ([j] + i)
# Multiple lists to weight
Import Itertools
s = sorted ([Sorted (i) for I in Stack])
Location = List (s for s,_ in Itertools.groupby (s))
Return location

def del_location (self, Support, location):
"Clear candidate sets that do not meet the criteria"
# Less than minimum support for culling
For Index,i in Enumerate (support):
If I < Self.line_num * self.min_support/100:
Support[index] = 0
# Apriori The second rule, remove
For Index,j in Enumerate (location):
Sub_location = [J[:index_loc] + j[index_loc+1:]for Index_loc in range (Len (j))]
Flag = 0
For K in Sub_location:
If k not in Self.location:
Flag = 1
Break
If flag:
Support[index] = 0
# Delete the unused location
location = [I for i,j in Zip (location,support) if J!= 0]
Support = [I for I in support if I!= 0]
return support, location

def loop (self):
"S-class iteration of frequent item level"
s = 2
While True:
print '-' *80
print ' The ', S-1, ' Loop '
print ' Location ', self.location
print ' Support ', self.support
print ' num ', self.num
print '-' *80

# Generate next-level candidate set
Location = Self.select (s)
Support = Self.sut (location)
Support, location = self.del_location (support, location)
num = List (sorted (set ([J For I-Location for J-I]))
S + + 1
If location and support and Num:
Self.pre_num = Self.num
Self.pre_location = Self.location
Self.pre_support = Self.support

Self.num = num
Self.location = Location
Self.support = Support
Else
Break

def confidence_sup (self):
"Calculate confidence"
If sum (self.pre_support) = = 0:
print ' Min_support error ' # first iteration is failure
Else
For index_location,each_location in Enumerate (self.location):
Del_num = [Each_location[:index] + each_location[index+1:] for index in range (len (each_location))] # Generate upper level frequent item level
Del_num = [I for i ' del_num if I in self.pre_location] # Deletion does not exist the upper-level frequent item-level subset
Del_support = [Self.pre_support[self.pre_location.index (i)] for I-del_num if I in Self.pre_location] # from the previous level of support lookup
# Print Del_num
# print Self.support[index_location]
# Print Del_support
For Index,i in Enumerate (del_num): # Calculate the support and confidence of each association rule
Index_support = 0
If Len (self.support)!= 1:
Index_support = Index
Support = float (self.support[index_location])/self.line_num * 100 # Support Degree
s = [J for Index_item,j in Enumerate (self.item_name) if index_item in i]
If Del_support[index]:
Confidence = float (self.support[index_location])/del_support[index] * 100
If confidence > self.min_confidence:
print ', '. Join (s), '->> ', Self.item_name[each_location[index]], ' min_support: ', str (support) + '% ', ' Min_con Fidence: ', str (confidence) + '% '

def main ():
c = Apriori (' Basket.txt ', 14, 3, 13)
D = Apriori (' Simple.txt ', 50, 2, 6)

if __name__ = = ' __main__ ':
Main ()
############################################################################
Status API Training Shop Blog about
©2014 GitHub, Inc. Terms privacy security contacts

Apriori algorithm

Apriori (filename, Min_support, Item_start, Item_end)

Parameter description

FileName: (path) filename
Min_support: Minimum Support degree
Item_start:item Start position
Item_end:item End Position

Use examples:

Copy Code code as follows:

Import Apriori
c = Apriori. Apriori (' Basket.txt ', 11, 3, 13)

Output:

Copy Code code as follows:

--------------------------------------------------------------------------------
The 1 loop
Location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
Support [299, 183, 177, 303, 204, 302, 293, 287, 184, 292, 276]
num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
Location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]]
Support [145, 173, 167, 170, 144]
num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
Location [[3, 5, 6]]
Support [146]
num [3, 5, 6]
--------------------------------------------------------------------------------
Frozenmeal,beer->> Cannedveg min_support:14.6% min_confidence:0.858823529412
Cannedveg,beer->> frozenmeal min_support:14.6% min_confidence:0.874251497006
Cannedveg,frozenmeal->> Beer min_support:14.6% min_confidence:0.843930635838
--------------------------------------------------------------------------------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.