About the Apriori algorithm principle Introduction Reference:
Click to open the link
Click to open the link
The algorithm mainly consists of two steps:
1, frequent item set of search
2, the formation of association rules
Core formula:
Support (A⇒B) =p (a∪b)
Confidence (a⇒b) =p (b| A) =support (a∪b) Support (a)
Let's take a look at the processed data.
Java PHP python crawler Spark data Analysis machine learning
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 0.0 1.0 0.0 1.0
2 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 0.0 1.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0 1.0 1.0
5 0.0 0.0 0.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0 0.0 1.0
7 1.0 0.0 1.0 1.0 0.0 0.0
8 1.0 1.0 0.0 0.0 1.0 0.0
9 0.0 0.0 1.0 0.0 0.0 1.0
Each line represents a student, and 1 indicates that the course was purchased
To define a connection function:
Import pandas as PD
# Custom Join functions for implementing L_{K-1} to C_k connection
def connect_string (x, MS):
x = list (lambda i:sorted ( I.split (ms)), x)
L = Len (x[0])
r = [] for
I in range (len (x)):
for J in range (I, Len (x)):
If X[i][:l- 1] = = X[j][:l-1] and x[i][l-1]!= x[j][l-1]:
r.append (X[i][:l-1] + sorted ([X[j][l-1], X[I][L-1])
re Turn R
The parameters passed by X are l_{k-1}, that is, the set of all frequent K-1 sets, MS is its own set of delimiters, here with '--'.
Cycle of thinking: 22 judged, if the first K-1 items of two sets of items are the same, but the K is different, then the two are spliced together to form an alternative set, and then form a c_k.
def find_rule (d, Support, confidence, ms=u '--'):
support_series = 1.0 * D.sum ()/Len (d) # single support degree sequence
D is the data, support is the support threshold, and Confiden is the confidence threshold value.
>>>support_series
Java 0.4
PHP 0.3
python crawler 0.7
Spark 0.3
Data Analysis 0.5
machine learning 0.5
Dtype:float64
Column = List (Support_series[support_series > Support].index) # Preliminary filtering
>>>column
[' Java ', '] based on support Python crawler ', ' Data analysis ', ' machine learning ' #这里支持度设置为0.3
While Len (column) > 1:
column = connect_string (column, ms) #连接
SF = Lambda I:d[i].prod (Axis=1, NUMERIC_ONLY=TR UE) #定义连乘函数
d_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T #创建连接数据
>>>column [[' Java ', ' Python crawler '], [' Java ', ' data analysis '], [' Java ', ' machine learning '], [' Python crawler ', ' data analysis '], [' Python crawler ', ' machine learning '] , [' Data analysis ', ' machine learning ']] >>> d_2 java--python crawler java--data analysis java--machine learning Python crawler--Data analysis Python crawler--machine learning \ 0 1.0 1.0 1.0 1.0 1.0 1 0.0 0.0 1.0 0.0 0.0 2       0.0 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 1.0 0.0 4 0.0 0.0 0.0 1.0 1.0 &NBsp 5 0.0 0.0 0.0 0.0 0.0 6 0.0 0.0 0.0 0.0 1.0 7 1.0 0.0 0.0 0.0 0.0 8 0.0 1.0 0.0 0.0 0.0 9 0.0 0.0 0.0 0.0
1.0   .....
Dataframe.prod () indicates that the value of each column in the data box is multiplied, and the parameter Axis=1 is a row multiplication, by which you can determine whether a transaction contains the set of items.
Such as:
>>> data[[' Java ', ' Python crawler ']].prod (Axis=1, Numeric_only=none)
0 1.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 1.0
8 0.0
9 0.0
Dtype:float64
support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d) # Compute the support
column = list after connection (support_seri Es_2[support_series_2 > Support].index) # New round of support screening
support_series = Support_series.append (support_ series_2) #将频繁项集添加到支持度序列中
The generation of association rules:
Column2 = []
for I in column: # Discovery of association Rules
i = I.split (ms) for
J in range (Len (i) ):
column2.append (I[:j] + i[j + 1:] + i[j:j + 1])
cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for
I in Column2: # Calculate confidence Sequence
Cofidence_ Series[ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])]
Confidence filtering
result = PD. Dataframe (index=[' support ', ' confidence ']) # defines output for
I in Cofidence_series[cofidence_series > confidence]. Index: # confidence filter
Result[i] = 0.0
result[i][' confidence ' = cofidence_series[i]
result[i][' support '] = Support_series[ms.join (Sorted (I.split (ms))]
result = result. T.sort_values ([' confidence ', ' support '], ascending=false) # result collation, output
print (results) return
Resul
Example:
spt=0.2
cfd=0.5
find_rule (DATA,SPT,CFD, "-->")
Results:
Results:
support confidence
Php-->java 0.3 1.000000
Spark-->java 0.3 1.000000
machine learning-->python crawler 0.4 0.800000
java-->php
0.3 Java-->spark 0.3 0.750000
data Analysis-->python crawler 0.3 0.600000
python crawler--> machine learning 0.4 0.571429
Full code:
#-*-Coding:utf-8-*-from __future__ import print_function import pandas as PD # Custom connection function, for implementing L_{K-1} to C_k connection Def Conne Ct_string (x, ms): x = List (map (Lambda i:sorted (I.split (MS), x)) L = Len (x[0)) R = [] for i in range (Len (
x): For j in range (I, Len (x)): if x[i][:l-1] = = X[j][:l-1] and x[i][l-1]!= X[j][l-1]: R.append (X[i][:l-1] + sorted ([X[j][l-1], x[i][l-1])) return R # Find the function def Find_rule for association rules (d, Support, Confidence, Ms=u '--'): result = PD. Dataframe (index=[' support ', ' confidence ']) # define OUTPUT Support_series = 1.0 * D.sum ()/Len (d) # support degree sequence column = Li St (Support_series[support_series > Support].index) # preliminary filtering k = 0 while len (column) > 1:k = k based on support degree + 1 print (U ' \ n is doing the first%s search ... '% k) column = connect_string (column, ms) print (U ' number:%s ... '% len (colu MN)) SF = Lambda I:d[i].prod (Axis=1, Numeric_only=none) # New batch of support COMPUTE function # Create connection data, this step is time-consuming, consumingSave the most serious.
When the dataset is large, parallel operation optimization can be considered. D_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d) # COMPUTE the support column = list after connection (su Pport_series_2[support_series_2 > Support].index) # New round of support screening support_series = Support_series.append (support_se
ries_2) Column2 = [] for i in column: # traverse possible inference, such as whether {a,b,c} is a+b-->c or B+c-->a or c+a-->b.
i = I.split (ms) for J in range (Len (i)): Column2.append (I[:j] + i[j + 1:] + i[j:j + 1]) Cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for I in Column2: # Calculate confidence Sequence cofidence_series[ Ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])] for I in Cofidence_ser Ies[cofidence_series > Confidence].index: # confidence Screening Result[i] = 0.0 result[i][' confidence '] = Co Fidence_Series[i] result[i][' support ' = Support_series[ms.join (sorted (ms))] result = result. T.sort_values ([' confidence ', ' support '], ascending=false) # result collation, output print (U ' \ n Result: ') print (results) return re
Sult
Data and source code is the "Data Mining and analysis" course of the document, this article is only my learning process notes.