Data mining-detailed explanation of the Apriori algorithm and Python implementation code, aprioripython
Association rule mining is one of the most active research methods in data mining, the earliest reason was to discover the relationship between different commodities in the supermarket transaction database. (Beer and diapers)
Basic Concepts
1. support definition: support (X --> Y) = | x y |/N = number of times/number of data records that items in set X and set Y appear in one record at the same time. For example: support ({beer} --> {diapers}) = number of simultaneous beer and diapers/data records = 3/5 = 60%.
2. Definition of confidence level: confidence (X --> Y) = | X |/| X | = number of times that items in set X and set Y appear in one record at the same time/number of items in set X. For example: confidence ({beer} --> {diapers}) = number of simultaneous beer and diapers/number of beer appearances = 3/3 = 100%; confidence ({diapers} --> {beer}) = number of simultaneous beer and diapers/number of diapers = 3/4 = 75%
Rules that meet the minimum support threshold (min_sup) and minimum confidence threshold (min_conf) are called strong rules. If an item set meets the minimum support, it is called a frequent item set.
"How to mine association rules from large databases ?" Association rule mining is a two-step process:
1. Identify all frequent item sets: as defined, these item sets are at least as frequent as the predefined minimum supported count.
2. Strong association rules generated by frequent item sets: as defined, these rules must meet the minimum support and minimum confidence level.
Apriori Law
In order to reduce the generation time of frequent item sets, we should eliminate a set of completely impossible frequent item sets as soon as possible. The two laws of Apriori do this.
Rule 1: If a set is a frequent item set, all its subsets are frequent items. For example, if A set {A, B} is A frequent item set, that is, if A and B appear in A record for more than or equal to the minimum support min_support, then its subset {}, the number of occurrences of {B} must be greater than or equal to min_support, that is, its subset is a frequent item set.
Apriori Law 2: If a set is not a frequent item set, all its supersets are not a frequent item set. For example, if the set {A} is not A frequent item set, that is, the number of times A appears is less than min_support, then the number of times any superset {A, B} appears must be less than min_support, therefore, its superset must not be a frequent item set.
The figure above demonstrates the process of the Apriori algorithm. Note that when a third-level candidate item set is generated by a second-level frequent item set, there is no {milk, bread, beer} Because {bread, beer} is not a second-level frequent itemset. This uses the Apriori theorem. After a third-level frequent item set is generated, there is no higher-level candidate item set. Therefore, the entire algorithm ends. {milk, bread, diapers} is the largest frequent subset.
Python implementation code:
Copy codeThe Code is as follows:
Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
Branch: master datamining/apriori. py
Hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc) 6.302 kb RawBlameHistory
#-*-Encoding: UTF-8 -*-
# --------------------------------- Import ------------------------------------
#---------------------------------------------------------------------------
Class Apriori (object ):
Def _ init _ (self, filename, min_support, item_start, item_end ):
Self. filename = filename
Self. min_support = min_support # minimum support
Self. min_confidence = 50
Self. line_num = 0 # number of items
Self. item_start = item_start # Which row of item is retrieved
Self. item_end = item_end
Self. location = [[I] for I in range (self. item_end-self. item_start + 1)]
Self. support = self. sut (self. location)
Self. num = list (sorted (set ([j for I in self. location for j in I]) # record item
Self. pre_support = [] # Save the previous support, location, num
Self. pre_location = []
Self. pre_num = []
Self. item_name = [] # project name
Self. find_item_name ()
Self. loop ()
Self. confidence_sup ()
Def deal_line (self, line ):
"Extract required items"
Return [I. strip () for I in line. split ('') if I] [self. item_start-1: self. item_end]
Def find_item_name (self ):
"Extract item_name from the first line"
With open (self. filename, 'R') as F:
For index, line in enumerate (F. readlines ()):
If index = 0:
Self. item_name = self. deal_line (line)
Break
Def sut (self, location ):
"""
Enter [[1, 2, 3], [2, 3], [1, 3, 5]...]
Output support [123,435,234...] for each location set
"""
With open (self. filename, 'R') as F:
Support = [0] * len (location)
For index, line in enumerate (F. readlines ()):
If index = 0: continue
# Extracting each information
Item_line = self. deal_line (line)
For index_num, I in enumerate (location ):
Flag = 0
For j in I:
If item_line [j]! = 'T ':
Flag = 1
Break
If not flag:
Support [index_num] + = 1
Self. line_num = index # Total number of rows and item_name of the first row
Return support
Def select (self, c ):
"Return location"
Stack = []
For I in self. location:
For j in self. num:
If j in I:
If len (I) = c:
Stack. append (I)
Else:
Stack. append ([j] + I)
# Deduplication of Multiple lists
Import itertools
S = sorted ([sorted (I) for I in stack])
Location = list (s for s, _ in itertools. groupby (s ))
Return location
Def del_location (self, support, location ):
"Clear candidate sets that do not meet the conditions"
# Remove a value smaller than minimum support
For index, I in enumerate (support ):
If I <self. line_num * self. min_support/100:
Support [index] = 0
# Remove the second rule of the apsaradb for redis service.
For index, j in enumerate (location ):
Sub_location = [j [: index_loc] + j [index_loc + 1:] for index_loc in range (len (j)]
Flag = 0
For k in sub_location:
If k not in self. location:
Flag = 1
Break
If flag:
Support [index] = 0
# Deleting useless locations
Location = [I for I, j in zip (location, support) if j! = 0]
Support = [I for I in support if I! = 0]
Return support, location
Def loop (self ):
"S-level frequent item-level iteration"
S = 2
While True:
Print '-' * 80
Print The ', s-1, 'login'
Print 'location', self. location
Print 'support ', self. support
Print 'num', self. num
Print '-' * 80
# Generate a candidate set for the next level
Location = self. select (s)
Support = self. sut (location)
Support, location = self. del_location (support, location)
Num = list (sorted (set ([j for I in location for j in I])
S + = 1
If location and support and num:
Self. pre_num = self. num
Self. pre_location = self. location
Self. pre_support = self. support
Self. num = num
Self. location = location
Self. support = support
Else:
Break
Def confidence_sup (self ):
"Computing confidence"
If sum (self. pre_support) = 0:
Print 'min _ support error' # The first iteration fails.
Else:
For index_location, each_location in enumerate (self. location ):
Del_num = [each_location [: index] + each_location [index + 1:] for index in range (len (each_location)] # generate a level-1 frequent item
Del_num = [I for I in del_num if I in self. pre_location] # delete a subset of previous-level frequent items that does not exist
Del_support = [self. pre_support [self. pre_location.index (I)] for I in del_num if I in self. pre_location] # search for the level-1 Support
# Print del_num
# Print self. support [index_location]
# Print del_support
For index, I in enumerate (del_num): # Calculate the support and confidence of each association rule.
Index_support = 0
If len (self. support )! = 1:
Index_support = index
Support = float (self. support [index_location])/self. line_num * 100 # support
S = [j for index_item, j in enumerate (self. item_name) if index_item in I]
If del_support [index]:
Confidence = float (self. support [index_location])/del_support [index] * 100
If confidence> self. min_confidence:
Print ','. join (s), '->', self. item_name [each_location [index], 'min_support: ', str (support) +' % ', 'min_confidence:', str (confidence) + '%'
Def main ():
C = Apriori('basket.txt ', 14, 3, 13)
D = paiori('simple.txt ', 50, 2, 6)
If _ name _ = '_ main __':
Main ()
######################################## ####################################
Status API Training Shop Blog About
©2014 GitHub, Inc. Terms Privacy Security Contact
Apriori algorithm
Apriori (filename, min_support, item_start, item_end)
Parameter description
Filename: (PATH) file name
Min_support: Minimum Support
Item_start: item start position
Item_end: end position of the item
Example:
Copy codeThe Code is as follows:
Import apriori
C = paiori.apriori('basket.txt ', 11, 3, 13)
Output:
Copy codeThe Code is as follows:
--------------------------------------------------------------------------------
The 1 loop
Location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]
Support [299,183,177,303,204,302,293,287,184,292,276]
Num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
Location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]
Support [145,173,167,170,144]
Num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
Location [[3, 5, 6]
Support [146]
Num [3, 5, 6]
--------------------------------------------------------------------------------
Frozenmeal, beer-> cannedveg min_support: 14.6% min_confidence: 0.858823529412
Cannedveg, beer-> frozenmeal min_support: 14.6% min_confidence: 0.874251497006
Cannedveg, frozenmeal-> beer min_support: 14.6% min_confidence: 0.843930635838
--------------------------------------------------------------------------------
Who has the code to implement the data mining APRIORI algorithm in JAVA? Urgent use
For better implementation, go to WEKA source code or www. helsinki. fi/...s.html ~
But in fact, it is annoying to understand what people have written, and the idea of "Apriori" is very basic. Java also has a lot of useful collection classes. I can write usable classes in just one day ~
Apriori algorithm Data Mining
I think weka should be perfect for you ^
It is very convenient to run your own algorithms or directly use its APIs for secondary development, as you mentioned ~ It is not difficult to compare the original algorithm with your own algorithm. initialize the two algorithm object models in your own code and train them together for testing, the final result can be put together. As for how to organize the graphic interface, just do what you need.
If you do not want to write code, run weka explorer or work flow several times on weka's Gui, because weka's graphical representation is diverse and intuitive. ^
This is what we recommend for a book:
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian Witten
It is weka's supporting teaching materials. There are many examples, which are simple and easy to use.
If you have any further questions, go to weka list and find the answers. This is a great discussion group. At least it helps me a lot (connected to references ).
Hope to help you ^
Reference: list. scms. waikato. ac. nz/mailman/htdig/wekalist/