Data Mining is a technology used to analyze and deduce data patterns from a large amount of data. It has a wide range of application prospects, such as friend recommendations on social networks and commodity recommendations on shopping websites. Up to now, data mining has produced a variety of data mining algorithms. Among them, Apriori is the most influential Algorithm for mining frequent item sets of Boolean association rules. This article uses the C language to implement a single minimum support level of the Apriori algorithm.
Before you start, briefly introduce the related concepts. Set I = {I1, I2, I3 ...... Im} is a project (item) set T = {T1, T2 ,......, TM} is a collection of things. Every things T is a collection of projects and t belongs to I. An association rule is defined as follows: X-> y, X and Y both belong to I, and the intersection of X and Y is an empty set. The number of all things in set t that contain X is called the Count of X (X. count), the support of a rule: sup = (xuy. count)/n, n is the number of things in T. Confidence Level: conf = (xuy. Count)/X. Count. The degree of support determines the frequency of the rule appearing in t, and the confidence level indicates the predictability of the rule. MINSUP and minconf are the minimum thresholds specified by the user to meet the conditions. A frequent project set is an item set with a higher support than MINSUP. How to efficiently find matching rules is a basic problem in data mining. If we find all the possible rules from I and then look for them from the rules, this will become an exponential increase with the increase of the I base, obviously not useful. Through a certain pruning optimization strategy, the algorithm greatly reduces the computing workload.
The Apriori algorithm can be roughly divided into two steps: 1. generate all frequent project sets, and 2. generate all trusted association rules from the frequent project set (that is, its confidence level is higher than minconf ). The Apriori algorithm generates all frequent project sets based on the deduction principle. For example, if a frequent project set of GE meets the minimum support requirement, all non-empty subsets of GE meet the minimum support requirement. Which simplifies the processing. The Apriori algorithm assumes that all projects in the project set are alphabetically ordered. It adopts the idea of step-by-step search: First generates a frequent item set, then generates frequent and item sets based on a frequent item set, and so on, and generates frequent K item sets. The pseudocode description of the Apriori algorithm that generates frequent K-item sets is given:
Algorith Apriori (t)
C1 <-- init-pass (t); perform the first round of searching for things
F1 <-- {f | f belongs to C1, F. Count> = MINSUP };
For (k = 2; Fk-1 nut NULL; k ++) Do
CK <-- candidate-Gen (Fk-1); // candidate K-item set generation function, divided into two steps: merge, Branch
For each transzction T is one of t do
For each candidate C is one of CK do
If C is contained in t then
C. Count ++;
Endfor
Endfor
FK <-- {C is one of CK | C. Count/n> = MINSUP}
End
Return
In the process of generating frequent K-item set based on the frequent K-1 item set, the support of each element in the frequent K-item set should be calculated, and whether each K-1 item subset in the K-item set is in the Fk-1 is calculated, if either of the preceding two conditions is not met, delete the elements in the K-item set. The pseudocode related to candidate-Gen and init-pass is not detailed. At the same time, we implement frequent generation of a set in init-pass. The Code is as follows (the main function is not tested ):
// Perform the first scan of a transaction, generate a frequent set, and return the number of int init_pass (char * item, char Tran [len_t] [Len], int Len, char res_item [len_t] [Len], float min_sup) {float t_sup; int number = 0; For (INT I = 0; I <Len; I ++) {int COUNT = 0; For (Int J = 0; j <len_t; j ++) {for (int K = 0; k <Len; k ++) if (item [I] = Tran [J] [k]) {count ++; break;} t_sup = count * 1.0/Len; if (t_sup> = min_sup) res_item [number ++] [0] = item [I];} return number-1;} // generates a candidate K-item set, returns the number of items in K sets. Int candidate_g En (char ktran [Len] [K], char kkktran [Len] [k + 1]) {char temp [K], temp1 [K], ktemp [k + 1]; int number = 0; For (INT I = 0; I <Len; I ++) {strcpy (temp, ktran [I]); bool flag; For (j = I + 1; j <Len; j ++) {strcpy (temp1, ktran [I]); For (INT m = 0; m <K; m ++) {If (M <K-1 & temp [m] = temp1 [m]) | M = k-1) {continue; flag = true;} else {flag = false; break;} If (FLAG) {If (temp [k-1]> temp1 [k-1]) {strcpy (ktemp, temp1 ); ktemp [k] = temp [k-1];} else {strcpy (ktemp, temp); ktemp [K] = temp1 [k-1]} break;} flag = judge (Kemp, ktran [Len] [k]); If (flag = true) strcpy (kktran [number ++], ktemp);} return number-1;} // determines whether the subset is in bool judge (char * srcstr, char desstr [Len] [k]) {char temp [k]; int COUNT = 0; For (INT I = 0; I <K-1; I ++) {for (Int J = 0; j <I; j ++) temp [J] = srcstr [J]; for (Int J = I + 1; j <k + 1; j ++) temp [J] = srcstr [J]; for (INT p = 0; P <Len; P ++) if (strcmp (temp, desstr [I]) = 0) {count ++; break;} If (COUNT = k-1) return true; RET Urn false;} // Apriori algorithm int Apriori (char item [Len], char Tran [length] [Len], char res_tran [length] [Len], float min_sup) {char ttran [length] [Len]; int number, Count, t_num; For (INT I = 0; I <length; I ++) for (Int J = 0; j <Len; j ++) ttran [I] [J] = '0'; number = init_pass (item, Tran [length] [Len], Len, ttran [length] [Len], min_sup); For (INT I = 0i <length; I ++) res_tran [I] [0] = ttran [I] [0]; for (int K = 2; number! = 0; k ++) {t_num = number; number = candidate_gen (res_item [number] [k-1], ttran [number] [k]); If (k = 2) continue; else {COUNT = 0; For (INT I = 0; I <number; I ++) {char temp [k]; strcpy (temp, ttran [I]); bool t_flag = false; For (Int J = 0; j <length; j ++) {// you can obtain the Count int t_k = 0 for each item in the Candidate K item set; for (INT n = 0; n <K; n ++) {bool m_flag = falsefor (int g = t_k; G <Len; G ++) {If (temp [k] = Tran [J] [g]) {m_flag = true; t_k = g; break ;}} if (m_flag = true & n = k-1) t_flag = true;} If (t_flag = true) Count ++; flag = false ;} if (count/length> min_sup) strcpy (res_item [I], temp); Count = 0 ;}} return t_num ;}
You can write the test main function by yourself. Generating Association Rules Based on a frequent K-item set is relatively simple. You only need to calculate the minimum confidence level to find all association rules that meet the conditions in the frequent K-item set.