International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Apriori algorithm source code parsing __ algorithm

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

About the Apriori algorithm principle Introduction Reference:

Click to open the link

The algorithm mainly consists of two steps:

1, frequent item set of search

2, the formation of association rules

Core formula:

Support (A⇒B) =p (a∪b)

Confidence (a⇒b) =p (b| A) =support (a∪b) Support (a)

Let's take a look at the processed data.

   Java  PHP  python crawler  Spark  data Analysis  machine learning
0   1.0  1.0       1.0    1.0   1.0   1.0
1   1.0  1.0       0.0    1.0   0.0   1.0
2    0.0   0.0 1.0 0.0 0.0   0.0
3   0.0  0.0       1.0    0.0   1.0   0.0
4   0.0  0.0       1.0    0.0   1.0   1.0
5   0.0  0.0       0.0 0.0 1.0   0.0
6   0.0  0.0       1.0    0.0   0.0   1.0
7   1.0  0.0 1.0 1.0   0.0   0.0
8   1.0  1.0       0.0    0.0   1.0   0.0
9   0.0   0.0 1.0 0.0 0.0   1.0

Each line represents a student, and 1 indicates that the course was purchased

To define a connection function:

Import pandas as PD
# Custom Join functions for implementing L_{K-1} to C_k connection
def connect_string (x, MS):
    x = list (lambda i:sorted (  I.split (ms)), x)
    L = Len (x[0])
    r = [] for
    I in range (len (x)):
        for J in range (I, Len (x)):
            If X[i][:l- 1] = = X[j][:l-1] and x[i][l-1]!= x[j][l-1]:
                r.append (X[i][:l-1] + sorted ([X[j][l-1], X[I][L-1])
    re Turn R

The parameters passed by X are l_{k-1}, that is, the set of all frequent K-1 sets, MS is its own set of delimiters, here with '--'.

Cycle of thinking: 22 judged, if the first K-1 items of two sets of items are the same, but the K is different, then the two are spliced together to form an alternative set, and then form a c_k.

def find_rule (d, Support, confidence, ms=u '--'):
    support_series = 1.0 * D.sum ()/Len (d)  # single support degree sequence

D is the data, support is the support threshold, and Confiden is the confidence threshold value.

>>>support_series
Java        0.4
PHP         0.3
python crawler    0.7
Spark       0.3
Data Analysis        0.5
machine learning        0.5
Dtype:float64

Column = List (Support_series[support_series > Support].index) # Preliminary filtering
>>>column
[' Java ', '] based on support Python crawler ', ' Data analysis ', ' machine learning ' #这里支持度设置为0.3

While Len (column) > 1:
        column = connect_string (column, ms) #连接
        SF = Lambda I:d[i].prod (Axis=1, NUMERIC_ONLY=TR UE) #定义连乘函数
d_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T  #创建连接数据

>>>column [[' Java ', ' Python crawler '], [' Java ', ' data analysis '], [' Java ', ' machine learning '], [' Python crawler ', ' data analysis '], [' Python crawler ', ' machine learning '] , [' Data analysis ', ' machine learning ']] >>> d_2    java--python crawler   java--data analysis   java--machine learning   Python crawler--Data analysis   Python crawler--machine learning   \ 0             1.0          1.0         1.0              1.0              1.0    1              0.0         0.0          1.0             0.0              0.0    2&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;       0.0         0.0          0.0             0.0              0.0    3              0.0         0.0          0.0              1.0             0.0    4              0.0          0.0         0.0              1.0              1.0  &NBsp 5             0.0          0.0         0.0              0.0              0.0    6             0.0          0.0         0.0              0.0              1.0    7              1.0         0.0         0.0             0.0              0.0    8              0.0         1.0         0.0             0.0              0.0    9              0.0         0.0          0.0             0.0  

            1.0  &nbsp .....

Dataframe.prod () indicates that the value of each column in the data box is multiplied, and the parameter Axis=1 is a row multiplication, by which you can determine whether a transaction contains the set of items.

Such as:

>>> data[[' Java ', ' Python crawler ']].prod (Axis=1, Numeric_only=none)
0    1.0
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    1.0
8    0.0
9    0.0
Dtype:float64

support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d)  # Compute the support
        column = list after connection (support_seri Es_2[support_series_2 > Support].index)  # New round of support screening
        support_series = Support_series.append (support_ series_2)  #将频繁项集添加到支持度序列中

The generation of association rules:

Column2 = []
for I in column: # Discovery of association Rules
i = I.split (ms) for
J in range (Len (i) ):
column2.append (I[:j] + i[j + 1:] + i[j:j + 1])

cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for

I in Column2: # Calculate confidence Sequence
Cofidence_ Series[ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])]

Confidence filtering

result = PD. Dataframe (index=[' support ', ' confidence ']) # defines output for
I in Cofidence_series[cofidence_series > confidence]. Index:  # confidence filter
            Result[i] = 0.0
            result[i][' confidence ' = cofidence_series[i]
            result[i][' support '] = Support_series[ms.join (Sorted (I.split (ms))]

result = result. T.sort_values ([' confidence ', ' support '], ascending=false)  # result collation, output
print (results) return
Resul

Example:

spt=0.2
cfd=0.5
find_rule (DATA,SPT,CFD, "-->")

Results:

Results:
                 support  confidence
Php-->java           0.3    1.000000
Spark-->java         0.3    1.000000
machine learning-->python crawler      0.4    0.800000
java-->php
0.3 Java-->spark         0.3    0.750000
data Analysis-->python crawler      0.3    0.600000
python crawler--> machine learning      0.4    0.571429

Full code:

#-*-Coding:utf-8-*-from __future__ import print_function import pandas as PD # Custom connection function, for implementing L_{K-1} to C_k connection Def Conne Ct_string (x, ms): x = List (map (Lambda i:sorted (I.split (MS), x)) L = Len (x[0)) R = [] for i in range (Len (
                x): For j in range (I, Len (x)): if x[i][:l-1] = = X[j][:l-1] and x[i][l-1]!= X[j][l-1]:  R.append (X[i][:l-1] + sorted ([X[j][l-1], x[i][l-1])) return R # Find the function def Find_rule for association rules (d, Support, Confidence, Ms=u '--'): result = PD. Dataframe (index=[' support ', ' confidence ']) # define OUTPUT Support_series = 1.0 * D.sum ()/Len (d) # support degree sequence column = Li St (Support_series[support_series > Support].index) # preliminary filtering k = 0 while len (column) > 1:k = k based on support degree + 1 print (U ' \ n is doing the first%s search ... '% k) column = connect_string (column, ms) print (U ' number:%s ... '% len (colu MN)) SF = Lambda I:d[i].prod (Axis=1, Numeric_only=none) # New batch of support COMPUTE function # Create connection data, this step is time-consuming, consumingSave the most serious.
        When the dataset is large, parallel operation optimization can be considered. D_2 = PD. Dataframe (Map (SF, column), Index=[ms.join (i) for I in column]). T support_series_2 = 1.0 * D_2[[ms.join (i) for I in Column]].sum ()/Len (d) # COMPUTE the support column = list after connection (su Pport_series_2[support_series_2 > Support].index) # New round of support screening support_series = Support_series.append (support_se
            ries_2) Column2 = [] for i in column: # traverse possible inference, such as whether {a,b,c} is a+b-->c or B+c-->a or c+a-->b.

        i = I.split (ms) for J in range (Len (i)): Column2.append (I[:j] + i[j + 1:] + i[j:j + 1]) Cofidence_series = PD. Series (Index=[ms.join (i) for I in Column2]) # define confidence sequence for I in Column2: # Calculate confidence Sequence cofidence_series[ Ms.join (i)] = Support_series[ms.join (sorted (i))]/Support_series[ms.join (I[:len (i)-1])] for I in Cofidence_ser Ies[cofidence_series > Confidence].index: # confidence Screening Result[i] = 0.0 result[i][' confidence '] = Co Fidence_Series[i] result[i][' support ' = Support_series[ms.join (sorted (ms))] result = result. T.sort_values ([' confidence ', ' support '], ascending=false) # result collation, output print (U ' \ n Result: ') print (results) return re

 Sult

Data and source code is the "Data Mining and analysis" course of the document, this article is only my learning process notes.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apriori algorithm source code parsing __ algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support