Apriori algorithm source code parsing __ algorithm

Source: Internet
Author: User

About the Apriori algorithm principle Introduction Reference:

Click to open the link

Click to open the link

The algorithm mainly consists of two steps:

1, frequent item set of search

2, the formation of association rules

Core formula:

Support (A⇒B) =p (a∪b)

Confidence (a⇒b) =p (b| A) =support (a∪b) Support (a)

Let's take a look at the processed data.

   Java  PHP  python crawler  Spark  data Analysis  machine learning
0   1.0  1.0       1.0    1.0   1.0   1.0
1   1.0  1.0       0.0    1.0   0.0   1.0
2    0.0   0.0 1.0 0.0 0.0   0.0
3   0.0  0.0       1.0    0.0   1.0   0.0
4   0.0  0.0       1.0    0.0   1.0   1.0
5   0.0  0.0       0.0 0.0 1.0   0.0
6   0.0  0.0       1.0    0.0   0.0   1.0
7   1.0  0.0 1.0 1.0   0.0   0.0
8   1.0  1.0       0.0    0.0   1.0   0.0
9   0.0   0.0 1.0 0.0 0.0   1.0

Each line represents a student, and 1 indicates that the course was purchased

To define a connection function:

Import pandas as PD
# Custom Join functions for implementing L_{K-1} to C_k connection
def connect_string (x, MS):
    x = list (lambda i:sorted (  I.split (ms)), x)
    L = Len (x[0])
    r = [] for
    I in range (len (x)):
        for J in range (I, Len (x)):
            If X[i][:l- 1] = = X[j][:l-1] and x[i][l-1]!= x[j][l-1]:
                r.append (X[i][:l-1] + sorted ([X[j][l-1], X[I][L-1])
    re Turn R
The parameters passed by X are l_{k-1}, that is, the set of all frequent K-1 sets, MS is its own set of delimiters, here with '--'.

Cycle of thinking: 22 judged, if the first K-1 items of two sets of items are the same, but the K is different, then the two are spliced together to form an alternative set, and then form a c_k.

def find_rule (d, Support, confidence, ms=u '--'):
    support_series = 1.0 * D.sum ()/Len (d)  # single support degree sequence

D is the data, support is the support threshold, and Confiden is the confidence threshold value.

>>>support_series
Java        0.4
PHP         0.3
python crawler    0.7
Spark       0.3
Data Analysis        0.5
machine learning        0.5
Dtype:float64
Column = List (Support_series[support_series > Support].index) # Preliminary filtering
>>>column
[' Java ', '] based on support Python crawler ', ' Data analysis ', ' machine learning ' #这里支持度设置为0.3
While Len (column) > 1:
        column = connect_string (column, ms) #连接
        SF = Lambda I:d[i].prod (Axis=1, NUMERIC_ONLY=TR UE) #定义连乘函数
d_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T  #创建连接数据
>>>column [[' Java ', ' Python crawler '], [' Java ', ' data analysis '], [' Java ', ' machine learning '], [' Python crawler ', ' data analysis '], [' Python crawler ', ' machine learning '] , [' Data analysis ', ' machine learning ']] >>> d_2    java--python crawler   java--data analysis   java--machine learning   Python crawler--Data analysis   Python crawler--machine learning   \ 0             1.0          1.0         1.0              1.0              1.0    1              0.0         0.0          1.0             0.0              0.0    2             0.0         0.0          0.0             0.0              0.0    3              0.0         0.0          0.0              1.0             0.0    4              0.0          0.0         0.0              1.0              1.0  &NBsp 5             0.0          0.0         0.0              0.0              0.0    6             0.0          0.0         0.0              0.0              1.0    7              1.0         0.0         0.0             0.0              0.0    8              0.0         1.0         0.0             0.0              0.0    9              0.0         0.0          0.0             0.0  

            1.0  &nbsp .....

Dataframe.prod () indicates that the value of each column in the data box is multiplied, and the parameter Axis=1 is a row multiplication, by which you can determine whether a transaction contains the set of items.

Such as:

>>> data[[' Java ', ' Python crawler ']].prod (Axis=1, Numeric_only=none)
0    1.0
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    1.0
8    0.0
9    0.0
Dtype:float64
support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d)  # Compute the support
        column = list after connection (support_seri Es_2[support_series_2 > Support].index)  # New round of support screening
        support_series = Support_series.append (support_ series_2)  #将频繁项集添加到支持度序列中

The generation of association rules:

Column2 = []
for I in column: # Discovery of association Rules
i = I.split (ms) for
J in range (Len (i) ):
column2.append (I[:j] + i[j + 1:] + i[j:j + 1])

cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for

I in Column2: # Calculate confidence Sequence
Cofidence_ Series[ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])]

Confidence filtering

result = PD. Dataframe (index=[' support ', ' confidence ']) # defines output for
I in Cofidence_series[cofidence_series > confidence]. Index:  # confidence filter
            Result[i] = 0.0
            result[i][' confidence ' = cofidence_series[i]
            result[i][' support '] = Support_series[ms.join (Sorted (I.split (ms))]
result = result. T.sort_values ([' confidence ', ' support '], ascending=false)  # result collation, output
print (results) return
Resul

Example:

spt=0.2
cfd=0.5
find_rule (DATA,SPT,CFD, "-->")
Results:
Results:
                 support  confidence
Php-->java           0.3    1.000000
Spark-->java         0.3    1.000000
machine learning-->python crawler      0.4    0.800000
java-->php
0.3 Java-->spark         0.3    0.750000
data Analysis-->python crawler      0.3    0.600000
python crawler--> machine learning      0.4    0.571429

Full code:

#-*-Coding:utf-8-*-from __future__ import print_function import pandas as PD # Custom connection function, for implementing L_{K-1} to C_k connection Def Conne Ct_string (x, ms): x = List (map (Lambda i:sorted (I.split (MS), x)) L = Len (x[0)) R = [] for i in range (Len (
                x): For j in range (I, Len (x)): if x[i][:l-1] = = X[j][:l-1] and x[i][l-1]!= X[j][l-1]:  R.append (X[i][:l-1] + sorted ([X[j][l-1], x[i][l-1])) return R # Find the function def Find_rule for association rules (d, Support, Confidence, Ms=u '--'): result = PD. Dataframe (index=[' support ', ' confidence ']) # define OUTPUT Support_series = 1.0 * D.sum ()/Len (d) # support degree sequence column = Li St (Support_series[support_series > Support].index) # preliminary filtering k = 0 while len (column) > 1:k = k based on support degree + 1 print (U ' \ n is doing the first%s search ... '% k) column = connect_string (column, ms) print (U ' number:%s ... '% len (colu MN)) SF = Lambda I:d[i].prod (Axis=1, Numeric_only=none) # New batch of support COMPUTE function # Create connection data, this step is time-consuming, consumingSave the most serious.
        When the dataset is large, parallel operation optimization can be considered. D_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d) # COMPUTE the support column = list after connection (su Pport_series_2[support_series_2 > Support].index) # New round of support screening support_series = Support_series.append (support_se
            ries_2) Column2 = [] for i in column: # traverse possible inference, such as whether {a,b,c} is a+b-->c or B+c-->a or c+a-->b.

        i = I.split (ms) for J in range (Len (i)): Column2.append (I[:j] + i[j + 1:] + i[j:j + 1]) Cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for I in Column2: # Calculate confidence Sequence cofidence_series[ Ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])] for I in Cofidence_ser Ies[cofidence_series > Confidence].index: # confidence Screening Result[i] = 0.0 result[i][' confidence '] = Co Fidence_Series[i] result[i][' support ' = Support_series[ms.join (sorted (ms))] result = result. T.sort_values ([' confidence ', ' support '], ascending=false) # result collation, output print (U ' \ n Result: ') print (results) return re

 Sult
Data and source code is the "Data Mining and analysis" course of the document, this article is only my learning process notes.













Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.