Introduction to Apriori algorithm:
Presumably we all know the principle of Apriori algorithm, the most famous association rule discovery Method R.agrawal proposed Apriori algorithm. The basic idea of the 1 Apriori algorithm 2 The basic idea of the Apriori algorithm is to compute the support degree of the item set by multiple scans of the database, and find the frequent itemsets to generate association rules. The Apriori algorithm scans the dataset multiple times. The first scan gets the set of frequent 1-itemsets, the result of the k>1 scan to produce a set of candidate K-Itemsets, then in the process of scanning to determine the support of elements, and finally at the end of each scan to compute the set of frequent K-itemsets, the algorithm is when the set of candidate K-set is empty. The process of generating frequent itemsets by Apriori algorithm
The process of generating frequent itemsets is mainly divided into two steps: connection and pruning.
(1) Connecting step. To find (k>=2), a set of candidate K-itemsets is generated by connecting to itself. Set and is the set of items in. The first J item in the notation. The Apriori algorithm assumes that the items in a transaction or set of items are sorted in dictionary order, and for the (K-1) item set, the corresponding items are sorted as. If the element and the first (K-2) corresponding entries are equal, then the and. That is, if, and can be connected. Conditions can guarantee no duplication, and searching for frequent itemsets in order avoids the search and statistics work on the set of items that are not likely to occur in the transaction database. The set of resultant items connected and generated is ().
(2) Pruning step. By the nature of the Apriori algorithm, any subset of frequent K-itemsets must be a frequent term set. The collection generated by the connection needs to be validated to remove the infrequent K-itemsets that do not satisfy the support degree. The main steps of the Apriori algorithm
(1) Scan all data to produce a set of candidate 1-item sets.
(2) According to the minimum support, a collection of frequent 1-item sets is generated from the set of Hou 1-option sets.
(3) for k>1, repeat steps (4) (5) (6).
(4) A candidate (k+1)-Itemsets collection is generated by performing the connection and pruning operations.
(5) According to the minimum support degree, the set of candidate (k+1)-itemsets is generated frequently (k+1)-itemsets set.
(6) If L, then k=k+1, jump to step (4), otherwise, skip to step (7).
(7) According to the minimum confidence level, the frequent itemsets generate strong association rules, and end. Apriori algorithm Description
Input: Database D, minimum support degree threshold min_sup.
Output: The frequent item set L in D.
(1) Begin
(2) =1-frequent itemsets;
(3) for (k=1;), k++) DO begin
(4) =apriori_gen (); {The calling Function Apriori_gen ()} produces the candidate K-item set} through the frequent itemsets (k-1)-Item Set}
(5) For all data sets do begin{scan d for counting}
(6) =subset (, T); {Use subset to find all subsets of the candidate in this thing}
(7) For all candidate sets do
(8) c.count++;
(9) End
(10)
(one) End
() End
(13) Return {form frequent set of sets}
The process of implementing Apriori algorithm in MATLAB:
algorithm one: after scanning the database to generate the Boolean matrix, carries on the operation to this Boolean matrix, through the Apriori link step pruning step to find its candidate set, then produces the candidate item set vector, for instance in the database transaction set has 5 items, 5 transactions: Transaction Matrix T
This is the generated Boolean matrix, where T represents the transaction, I represents the item (attribute), and how to scan the transaction matrix. In this algorithm, a candidate set vector is used to match each row in the transaction matrix, for example, the candidate 2-item set is {} then its candidate 2-itemsets vector is s=[0 1 1 0 0], in order to find the candidate set support, directly with s respectively and the transaction matrix of each line to do the inner product, namely sum=s./t (I,:), I is the line number of T, line I
If sum==2 the support count for this candidate set is +1
The support for this candidate set is then obtained.
APRIORI.M
function Apriori (T, minsup) M = Size (t,1),% transaction Number N = Size (t,2),% attribute number C=cell (1,n);
Stcount=sum (T)/m;% candidate set support for R=1:n c{r}=r;
End L=c (stcount>minsup);% find out the value of >=mst in Count Ll=l;
Icount=sum (t,2);%t the rows of the matrix and k=1;% the items of the frequent itemsets?
Disp (Numel (L));
% Initialize Counter k=1;% The number of items in frequent itemsets b=[];
Bb=reshape (Cell2mat (L), 1,numel (L)); % iterations while ~isempty (L) can be used directly to empty the IsEmpty () function.
The while loop is a 41-87-line cycle of generating frequent itemsets, which is changed by l{k}-->l{k+1} c={};
%l={};
u=0;
For R=1:numel (L) for I=r: (Numel (L)-1) X1=l{r};
X2=L{I+1};
If K==1 c=0;
else y1=x1;
y2=x2;
Y1 (k) =[];
Y2 (k) =[];
c = SUM (y1==y2);% the intersection End if (c==k-1)% of two candidate sets is judged 1. Whether the intersection length is 1,2. Determines whether the same number of le-1 before the intersection C and X1 is le-1
new=x1;NEW (k+1) =x2 (k);
Sub_set=subset (new);% the subset of new is to generate all the K-1 subsets of the candidate len=length (Sub_set); % to judge whether these K-1 items themselves are frequent p=1;
n=0;
while (P && n<len) n=n+1;% count subsets belong to the number of frequent item sets if K==1 P=in (SUB_SET{N},BB);%in function to determine whether the new subset belongs to L frequent itemsets else P=in (sub_se T{N},B);%in function to determine whether the new subset belongs to L frequent itemsets End If n==len% if Count N and Len
And so on, then its subset all belongs to the frequent itemsets u=u+1;
% Candidate K Sets c{u}=new;% this NEW set of criteria to the C candidate set end else
Break
End-End l={};
w=0;
For R=1:numel (C) Ss=zeros (n,1);
SS (C{r}) = 1; Sup=sum (t*ss==k+1)/M;
If Sup > Minsup w=w+1;
L{W}=C{R};
End-End B=reshape (Cell2mat (L), K+1,numel (L));
Disp (Numel (L));%%%%%%%%%%%%%%%%%% clear C;
k=k+1; End End
BOOLEMATRIX.M Converts the data into a Boolean matrix, noting that the data read must be the same number of data per row
function [B] = Boolematrix (A)
%untitled Summary of this function goes here
% Detailed explanation goes C4/>m=size (a,1);
N=size (a,2);
B=zeros (m,119);
For i=1:m for
j=1:n
B (I,a (i,j)) =1;
End
End
in.m Determines whether a subset of candidate sets is frequent itemsets
function [Re]=copy_of_in (a,b)
re=0;
B=b ';
M=size (b,1);
N=length (a);
if (sum (all (b = = Repmat (a,size (b,1), 1), 2) ==1)
re=1;
End End
main.m The main function, you can run this file
function Main ()
load Mushroom.mat;
Mushroomboolematrix=boolematrix (mushroom);
Minsup = 0.2;
Apriori (mushroomboolematrix,minsup);
End
subset.m to find a subset of candidate set
function [A]=subset (b)
% for a set containing k elements, generates a subset of all k-1 subsets of the set, and subtracts one element from the complete collection to
obtain a K-1 subset
m=length (b);
A{1}=b (2:m);
For i=2:m
new=b;
NEW (i) =[];
a{i}=new;
End End
algorithm Two: This algorithm differs from the previous algorithm in that this algorithm for the support of the candidate set of the corresponding transaction matrix to find out the column for the "and" operation, such as a candidate 2-item set is {}, then take out the two columns of the transaction matrix for phase "and" operation, namely: [11011] &[11100]=[1 1 0 0 0] The support count for this candidate 2-item set is 2; The code is as follows:
Apriori The function of finding frequent item sets
function Apriori (T, minsup) M = Size (t,1),% transaction number% of attributes in DataSet N = Size (t,2),% attribute number%
Find frequent item sets of size 1 (list of all items with Minsup) l={};
For i = 1:n S = SUM (T (:, i))/M;
If S >= minsup l = [l; i];
End End Ll=l; %find Frequent item sets of size >=2 and from those identify rules and minconf% Initialize Counter disp (Numel (LL)); %%%%%%%%%%%%%%%%%%%%%%%% Initialize Counter k=1;% The number of items in the frequent item set% iterations while ~isempty (LL) can be used directly to null the IsEmpty () function.
The while loop is a 41-87-line cycle of generating frequent itemsets, which is changed by l{k}-->l{k+1} c={};
l={};
w=0;
For R=1:numel (LL) for i=r: (Numel (LL)-1) ecount=0;
For j=1: (k-1) if (Ll{r} (j) ==ll{i+1} (j)) ecount=ecount+1;
else break;
End End If (ecount== (k-1)) w=w+1;
NEW=LL{R};
NEW (k+1) =ll{i+1} (k);
C{w}=new;
else break;
End End w=0;
For R=1:numel (C) s=t (:, c{r});
[~, X]=size (S);
Ss=ones (m,1);
For I=1:x ss=ss& (S (:, i));
End Sup=sum (SS)/M;
If Sup >= minsup w=w+1;
L{W}=C{R};
End End Ll=l;
Disp (Numel (LL));%%%%%%%%%%%%%%%%%%% Increment Counter k=k+1; End End
boolematrix.m Generating boolean matrices
function [B] = Boolematrix (A)
%untitled Summary of this function goes here
% Detailed explanation goes C4/>m=size (a,1);
N=size (a,2);
B=zeros (m,n);
For i=1:m for
j=1:n
B (I,a (i,j)) =1;
End
End
main.m The main function, executing this function can
%%% See memory%% profile on-memory%% myprog% of% profile
viewer
function Main ()
Loa D Mushroom.mat;
Mushroomboolematrix=boolematrix (mushroom);
Minsup = 0.2;
Apriori (mushroomboolematrix,minsup);
End