Apriori algorithms for data Mining and Python for code sharing

Source: Internet
Author: User
Association Rule Mining (Association rule Mining) is one of the most active research methods in data mining, which can be used to find the connection between things, the first is to discover the relationship between different goods in the supermarket transaction database. (Beer and diapers)

Basic concepts

1, the definition of support degree: supporting (x-->y) = | X intersection y|/n= the number of times/data records that appear in a record for the items in the collection x and set Y. For example: Support ({beer}-->{diaper}) = number of simultaneous occurrences of beer and diapers/number of data records = 3/5=60%.

2, the definition of confidence: confidence (x-->y) = | X-Cross y|/| X| = the number of times/sets x occurrences of the items in the set X and set y in one record. For example: Confidence ({beer}-->{diaper}) = number of times the beer and diaper appear/the number of times the beer appears =3/3=100%;confidence ({diaper}-->{beer}) = number of times the beer and diaper appear/the number of times the diaper appears = 3/ 4 = 75%

Rules that meet both the minimum support threshold (MIN_SUP) and the minimum Confidence threshold (min_conf) are called strong rules and are called frequent itemsets if the itemsets meet the Minimum support level

"How are association rules mined by large databases?" Mining Association Rules is a two-step process:

1. Find all frequent itemsets: by definition, these itemsets occur at least as frequently as the predefined minimum support count.
2. Strong association rules are generated by frequent itemsets: by definition, these rules must meet the minimum support and minimum confidence level.

Apriori Law

In order to reduce the generation time of frequent itemsets, we should eliminate some of the set of frequent itemsets as early as possible, and the two law of Apriori is to do it.

Apriori Law 1: If a collection is a frequent itemsets, all its subsets are frequent itemsets. For example, suppose that a set {a b} is a frequent itemsets, that is, a and B occur at the same time a record is greater than or equal to the minimum support degree min_support, then its subset {A},{B} occurrences must be greater than or equal to Min_support, that is, its subset is a frequent itemsets.

Apriori Law 2: If a collection is not a frequent itemsets, all its superset is not a frequent itemsets. For example, suppose that the set {a} is not a frequent itemsets, that is, a occurs less than Min_support, then any superset of it, such as {A, b}, must appear less than Min_support, so its superset must not be a frequent itemsets.

The above diagram illustrates the process of the Apriori algorithm, noting that when generating a three-level candidate set by a level two frequent itemsets, there is no {milk, bread, beer}, that is because {bread, beer} is not a two-level frequent itemsets, and the Apriori theorem is used here. After the final generation of the tertiary frequent itemsets, there is no higher candidate set, so the entire algorithm ends, {milk, bread, diapers} is the largest frequent subset.

Python implementation code:

Copy the Code code as follows:


Skip to Content
Sign Repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
Branch:master datamining/apriori/apriori.py
Hackerxutaizilongxu days ago Backup
1 contributor
156 lines (SLOC) 6.302 KB rawblamehistory
#-*-Encoding:utf-8-*-
#---------------------------------Import------------------------------------
#---------------------------------------------------------------------------
Class Apriori (object):

def __init__ (self, filename, min_support, Item_start, item_end):
Self.filename = filename
Self.min_support = min_support # Minimum support level
Self.min_confidence = 50
self.line_num = 0 # line number of item
Self.item_start = Item_start # What line of item to take
Self.item_end = Item_end

Self.location = [[i] for I in range (Self.item_end-self.item_start + 1)]
Self.support = Self.sut (self.location)
Self.num = List (sorted (set ([J for I under Self.location for J in I])) # Record Item

Self.pre_support = [] # Save Previous Support,location,num
Self.pre_location = []
Self.pre_num = []

Self.item_name = [] # project name
Self.find_item_name ()
Self.loop ()
Self.confidence_sup ()

def deal_line (self, line):
"Extract the required items"
return [I.strip () for I in Line.split (") if I][self.item_start-1:self.item_end]

def find_item_name (self):
"Extract Item_name according to the first line"
With open (Self.filename, ' R ') as F:
For Index,line in Enumerate (F.readlines ()):
If index = = 0:
Self.item_name = Self.deal_line (line)
Break

def sut (self, location):
"""
input [[1,2,3],[2,3,4],[1,3,5] ...]
Output support for each location set [123,435,234 ...]
"""
With open (Self.filename, ' R ') as F:
Support = [0] * LEN (location)
For Index,line in Enumerate (F.readlines ()):
If index = = 0:continue
# Extract each message
Item_line = Self.deal_line (line)
For Index_num,i in Enumerate (location):
Flag = 0
For j in I:
If ITEM_LINE[J]! = ' T ':
Flag = 1
Break
If not flag:
Support[index_num] + = 1
Self.line_num = index # Total number of lines, out of the first line of Item_name
Return support

def select (Self, c):
"Return to Location"
stack = []
For I in Self.location:
For J in Self.num:
If J in I:
If Len (i) = = C:
Stack.append (i)
Else
Stack.append ([j] + i)
# Multi-list to go heavy
Import Itertools
s = sorted ([Sorted (i) for I in Stack])
Location = List (S-s,_ in Itertools.groupby (s))
Return location

def del_location (self, Support, location):
"Clear candidate sets that do not meet the criteria"
# Less than the minimum support degree of culling
For Index,i in Enumerate (support):
If I < Self.line_num * self.min_support/100:
Support[index] = 0
# apriori Second rule, culling
For Index,j in Enumerate (location):
Sub_location = [J[:index_loc] + j[index_loc+1:]for Index_loc in range (Len (j))]
Flag = 0
For K in Sub_location:
If k not in Self.location:
Flag = 1
Break
If flag:
Support[index] = 0
# Remove Unused Locations
location = [i-I,j in Zip (location,support) if J! = 0]
Support = [I-I-in-support if I! = 0]
return support, location

def loop (self):
"S-class frequent item-level iterations"
s = 2
While True:
print '-' *80
print ' The ', S-1, ' Loop '
print ' Location ', self.location
print ' Support ', self.support
print ' num ', self.num
print '-' *80

# Generate Next Level candidate set
Location = Self.select (s)
Support = Self.sut (location)
Support, location = self.del_location (support, location)
num = List (sorted (set ([J for I under location for J in I]))
s + = 1
If location and support and Num:
Self.pre_num = Self.num
Self.pre_location = Self.location
Self.pre_support = Self.support

Self.num = num
Self.location = Location
Self.support = Support
Else
Break

def confidence_sup (self):
"Calculate confidence"
If sum (self.pre_support) = = 0:
print ' Min_support error ' # The first iteration fails
Else
For index_location,each_location in Enumerate (self.location):
Del_num = [Each_location[:index] + each_location[index+1:] for the index in range (len (each_location))] # generates a top-level frequent item level
Del_num = [i-i-del_num if I in self.pre_location] # Delete does not exist on a first-level frequent item-level subset
Del_support = [Self.pre_support[self.pre_location.index (i)] for i in Del_num if I in Self.pre_location] # from the top level of support lookup
# Print Del_num
# print Self.support[index_location]
# Print Del_support
For Index,i in Enumerate (del_num): # Calculates the degree of support and confidence of each association rule
Index_support = 0
If Len (self.support)! = 1:
Index_support = Index
Support = float (self.support[index_location])/self.line_num * 100 # Supporting Degree
s = [J for Index_item,j in Enumerate (self.item_name) if index_item in i]
If Del_support[index]:
Confidence = float (self.support[index_location])/del_support[index] * 100
If confidence > self.min_confidence:
print ', '. Join (s), '->> ', Self.item_name[each_location[index]], ' min_support: ', str (support) + '% ', ' Min_con Fidence: ', str (confidence) + '% '

def main ():
c = Apriori (' Basket.txt ', 14, 3, 13)
D = Apriori (' Simple.txt ', 50, 2, 6)

if __name__ = = ' __main__ ':
Main ()
############################################################################
Status API Training Shop Blog About
© GitHub, Inc. Terms Privacy Security Contact

Apriori algorithm

Apriori (filename, Min_support, Item_start, Item_end)

Parameter description

FileName: (path) file name
Min_support: Minimum support level
Item_start:item Start position
Item_end:item End Position

Examples of Use:

Copy the Code code as follows:


Import Apriori
c = Apriori. Apriori (' Basket.txt ', 11, 3, 13)

Output:

Copy the Code code as follows:


--------------------------------------------------------------------------------
The 1 loop
Location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
Support [299, 183, 177, 303, 204, 302, 293, 287, 184, 292, 276]
num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
Location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]]
Support [145, 173, 167, 170, 144]
num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
Location [[3, 5, 6]]
Support [146]
num [3, 5, 6]
--------------------------------------------------------------------------------
Frozenmeal,beer->> Cannedveg min_support:14.6% min_confidence:0.858823529412
Cannedveg,beer->> frozenmeal min_support:14.6% min_confidence:0.874251497006
Cannedveg,frozenmeal->> Beer min_support:14.6% min_confidence:0.843930635838
--------------------------------------------------------------------------------
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.