Data mining-detailed explanation of the Apriori algorithm and Python implementation code, aprioripython

Source: Internet
Author: User

Data mining-detailed explanation of the Apriori algorithm and Python implementation code, aprioripython

Association rule mining is one of the most active research methods in data mining, the earliest reason was to discover the relationship between different commodities in the supermarket transaction database. (Beer and diapers)

Basic Concepts

1. support definition: support (X --> Y) = | x y |/N = number of times/number of data records that items in set X and set Y appear in one record at the same time. For example: support ({beer} --> {diapers}) = number of simultaneous beer and diapers/data records = 3/5 = 60%.

2. Definition of confidence level: confidence (X --> Y) = | X |/| X | = number of times that items in set X and set Y appear in one record at the same time/number of items in set X. For example: confidence ({beer} --> {diapers}) = number of simultaneous beer and diapers/number of beer appearances = 3/3 = 100%; confidence ({diapers} --> {beer}) = number of simultaneous beer and diapers/number of diapers = 3/4 = 75%

Rules that meet the minimum support threshold (min_sup) and minimum confidence threshold (min_conf) are called strong rules. If an item set meets the minimum support, it is called a frequent item set.

"How to mine association rules from large databases ?" Association rule mining is a two-step process:

1. Identify all frequent item sets: as defined, these item sets are at least as frequent as the predefined minimum supported count.
2. Strong association rules generated by frequent item sets: as defined, these rules must meet the minimum support and minimum confidence level.

Apriori Law

In order to reduce the generation time of frequent item sets, we should eliminate a set of completely impossible frequent item sets as soon as possible. The two laws of Apriori do this.

Rule 1: If a set is a frequent item set, all its subsets are frequent items. For example, if A set {A, B} is A frequent item set, that is, if A and B appear in A record for more than or equal to the minimum support min_support, then its subset {}, the number of occurrences of {B} must be greater than or equal to min_support, that is, its subset is a frequent item set.

Apriori Law 2: If a set is not a frequent item set, all its supersets are not a frequent item set. For example, if the set {A} is not A frequent item set, that is, the number of times A appears is less than min_support, then the number of times any superset {A, B} appears must be less than min_support, therefore, its superset must not be a frequent item set.

The figure above demonstrates the process of the Apriori algorithm. Note that when a third-level candidate item set is generated by a second-level frequent item set, there is no {milk, bread, beer} Because {bread, beer} is not a second-level frequent itemset. This uses the Apriori theorem. After a third-level frequent item set is generated, there is no higher-level candidate item set. Therefore, the entire algorithm ends. {milk, bread, diapers} is the largest frequent subset.

Python implementation code:

Copy codeThe Code is as follows:
Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
Branch: master datamining/apriori. py
Hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc) 6.302 kb RawBlameHistory
#-*-Encoding: UTF-8 -*-
# --------------------------------- Import ------------------------------------
#---------------------------------------------------------------------------
Class Apriori (object ):

Def _ init _ (self, filename, min_support, item_start, item_end ):
Self. filename = filename
Self. min_support = min_support # minimum support
Self. min_confidence = 50
Self. line_num = 0 # number of items
Self. item_start = item_start # Which row of item is retrieved
Self. item_end = item_end

Self. location = [[I] for I in range (self. item_end-self. item_start + 1)]
Self. support = self. sut (self. location)
Self. num = list (sorted (set ([j for I in self. location for j in I]) # record item

Self. pre_support = [] # Save the previous support, location, num
Self. pre_location = []
Self. pre_num = []

Self. item_name = [] # project name
Self. find_item_name ()
Self. loop ()
Self. confidence_sup ()

Def deal_line (self, line ):
"Extract required items"
Return [I. strip () for I in line. split ('') if I] [self. item_start-1: self. item_end]

Def find_item_name (self ):
"Extract item_name from the first line"
With open (self. filename, 'R') as F:
For index, line in enumerate (F. readlines ()):
If index = 0:
Self. item_name = self. deal_line (line)
Break

Def sut (self, location ):
"""
Enter [[1, 2, 3], [2, 3], [1, 3, 5]...]
Output support [123,435,234...] for each location set
"""
With open (self. filename, 'R') as F:
Support = [0] * len (location)
For index, line in enumerate (F. readlines ()):
If index = 0: continue
# Extracting each information
Item_line = self. deal_line (line)
For index_num, I in enumerate (location ):
Flag = 0
For j in I:
If item_line [j]! = 'T ':
Flag = 1
Break
If not flag:
Support [index_num] + = 1
Self. line_num = index # Total number of rows and item_name of the first row
Return support

Def select (self, c ):
"Return location"
Stack = []
For I in self. location:
For j in self. num:
If j in I:
If len (I) = c:
Stack. append (I)
Else:
Stack. append ([j] + I)
# Deduplication of Multiple lists
Import itertools
S = sorted ([sorted (I) for I in stack])
Location = list (s for s, _ in itertools. groupby (s ))
Return location

Def del_location (self, support, location ):
"Clear candidate sets that do not meet the conditions"
# Remove a value smaller than minimum support
For index, I in enumerate (support ):
If I <self. line_num * self. min_support/100:
Support [index] = 0
# Remove the second rule of the apsaradb for redis service.
For index, j in enumerate (location ):
Sub_location = [j [: index_loc] + j [index_loc + 1:] for index_loc in range (len (j)]
Flag = 0
For k in sub_location:
If k not in self. location:
Flag = 1
Break
If flag:
Support [index] = 0
# Deleting useless locations
Location = [I for I, j in zip (location, support) if j! = 0]
Support = [I for I in support if I! = 0]
Return support, location

Def loop (self ):
"S-level frequent item-level iteration"
S = 2
While True:
Print '-' * 80
Print The ', s-1, 'login'
Print 'location', self. location
Print 'support ', self. support
Print 'num', self. num
Print '-' * 80

# Generate a candidate set for the next level
Location = self. select (s)
Support = self. sut (location)
Support, location = self. del_location (support, location)
Num = list (sorted (set ([j for I in location for j in I])
S + = 1
If location and support and num:
Self. pre_num = self. num
Self. pre_location = self. location
Self. pre_support = self. support

Self. num = num
Self. location = location
Self. support = support
Else:
Break

Def confidence_sup (self ):
"Computing confidence"
If sum (self. pre_support) = 0:
Print 'min _ support error' # The first iteration fails.
Else:
For index_location, each_location in enumerate (self. location ):
Del_num = [each_location [: index] + each_location [index + 1:] for index in range (len (each_location)] # generate a level-1 frequent item
Del_num = [I for I in del_num if I in self. pre_location] # delete a subset of previous-level frequent items that does not exist
Del_support = [self. pre_support [self. pre_location.index (I)] for I in del_num if I in self. pre_location] # search for the level-1 Support
# Print del_num
# Print self. support [index_location]
# Print del_support
For index, I in enumerate (del_num): # Calculate the support and confidence of each association rule.
Index_support = 0
If len (self. support )! = 1:
Index_support = index
Support = float (self. support [index_location])/self. line_num * 100 # support
S = [j for index_item, j in enumerate (self. item_name) if index_item in I]
If del_support [index]:
Confidence = float (self. support [index_location])/del_support [index] * 100
If confidence> self. min_confidence:
Print ','. join (s), '->', self. item_name [each_location [index], 'min_support: ', str (support) +' % ', 'min_confidence:', str (confidence) + '%'

Def main ():
C = Apriori('basket.txt ', 14, 3, 13)
D = paiori('simple.txt ', 50, 2, 6)

If _ name _ = '_ main __':
Main ()
######################################## ####################################
Status API Training Shop Blog About
©2014 GitHub, Inc. Terms Privacy Security Contact

Apriori algorithm

Apriori (filename, min_support, item_start, item_end)

Parameter description

Filename: (PATH) file name
Min_support: Minimum Support
Item_start: item start position
Item_end: end position of the item

Example:

Copy codeThe Code is as follows:
Import apriori
C = paiori.apriori('basket.txt ', 11, 3, 13)

Output:

Copy codeThe Code is as follows:
--------------------------------------------------------------------------------
The 1 loop
Location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]
Support [299,183,177,303,204,302,293,287,184,292,276]
Num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
Location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]
Support [145,173,167,170,144]
Num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
Location [[3, 5, 6]
Support [146]
Num [3, 5, 6]
--------------------------------------------------------------------------------
Frozenmeal, beer-> cannedveg min_support: 14.6% min_confidence: 0.858823529412
Cannedveg, beer-> frozenmeal min_support: 14.6% min_confidence: 0.874251497006
Cannedveg, frozenmeal-> beer min_support: 14.6% min_confidence: 0.843930635838
--------------------------------------------------------------------------------


Who has the code to implement the data mining APRIORI algorithm in JAVA? Urgent use

For better implementation, go to WEKA source code or www. helsinki. fi/...s.html ~

But in fact, it is annoying to understand what people have written, and the idea of "Apriori" is very basic. Java also has a lot of useful collection classes. I can write usable classes in just one day ~

Apriori algorithm Data Mining

I think weka should be perfect for you ^

It is very convenient to run your own algorithms or directly use its APIs for secondary development, as you mentioned ~ It is not difficult to compare the original algorithm with your own algorithm. initialize the two algorithm object models in your own code and train them together for testing, the final result can be put together. As for how to organize the graphic interface, just do what you need.

If you do not want to write code, run weka explorer or work flow several times on weka's Gui, because weka's graphical representation is diverse and intuitive. ^

This is what we recommend for a book:
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian Witten
It is weka's supporting teaching materials. There are many examples, which are simple and easy to use.

If you have any further questions, go to weka list and find the answers. This is a great discussion group. At least it helps me a lot (connected to references ).

Hope to help you ^
Reference: list. scms. waikato. ac. nz/mailman/htdig/wekalist/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.