C language implementation of frequent project set generation algorithms in the Apriori algorithm

Source: Internet
Author: User

Data Mining is a technology used to analyze and deduce data patterns from a large amount of data. It has a wide range of application prospects, such as friend recommendations on social networks and commodity recommendations on shopping websites. Up to now, data mining has produced a variety of data mining algorithms. Among them, Apriori is the most influential Algorithm for mining frequent item sets of Boolean association rules. This article uses the C language to implement a single minimum support level of the Apriori algorithm.

Before you start, briefly introduce the related concepts. Set I = {I1, I2, I3 ...... Im} is a project (item) set T = {T1, T2 ,......, TM} is a collection of things. Every things T is a collection of projects and t belongs to I. An association rule is defined as follows: X-> y, X and Y both belong to I, and the intersection of X and Y is an empty set. The number of all things in set t that contain X is called the Count of X (X. count), the support of a rule: sup = (xuy. count)/n, n is the number of things in T. Confidence Level: conf = (xuy. Count)/X. Count. The degree of support determines the frequency of the rule appearing in t, and the confidence level indicates the predictability of the rule. MINSUP and minconf are the minimum thresholds specified by the user to meet the conditions. A frequent project set is an item set with a higher support than MINSUP. How to efficiently find matching rules is a basic problem in data mining. If we find all the possible rules from I and then look for them from the rules, this will become an exponential increase with the increase of the I base, obviously not useful. Through a certain pruning optimization strategy, the algorithm greatly reduces the computing workload.

The Apriori algorithm can be roughly divided into two steps: 1. generate all frequent project sets, and 2. generate all trusted association rules from the frequent project set (that is, its confidence level is higher than minconf ). The Apriori algorithm generates all frequent project sets based on the deduction principle. For example, if a frequent project set of GE meets the minimum support requirement, all non-empty subsets of GE meet the minimum support requirement. Which simplifies the processing. The Apriori algorithm assumes that all projects in the project set are alphabetically ordered. It adopts the idea of step-by-step search: First generates a frequent item set, then generates frequent and item sets based on a frequent item set, and so on, and generates frequent K item sets. The pseudocode description of the Apriori algorithm that generates frequent K-item sets is given:

Algorith Apriori (t)

C1 <-- init-pass (t); perform the first round of searching for things

F1 <-- {f | f belongs to C1, F. Count> = MINSUP };

For (k = 2; Fk-1 nut NULL; k ++) Do

CK <-- candidate-Gen (Fk-1); // candidate K-item set generation function, divided into two steps: merge, Branch

For each transzction T is one of t do

For each candidate C is one of CK do

If C is contained in t then

C. Count ++;

Endfor

Endfor

FK <-- {C is one of CK | C. Count/n> = MINSUP}

End

Return

In the process of generating frequent K-item set based on the frequent K-1 item set, the support of each element in the frequent K-item set should be calculated, and whether each K-1 item subset in the K-item set is in the Fk-1 is calculated, if either of the preceding two conditions is not met, delete the elements in the K-item set. The pseudocode related to candidate-Gen and init-pass is not detailed. At the same time, we implement frequent generation of a set in init-pass. The Code is as follows (the main function is not tested ):

// Perform the first scan of a transaction, generate a frequent set, and return the number of int init_pass (char * item, char Tran [len_t] [Len], int Len, char res_item [len_t] [Len], float min_sup) {float t_sup; int number = 0; For (INT I = 0; I <Len; I ++) {int COUNT = 0; For (Int J = 0; j <len_t; j ++) {for (int K = 0; k <Len; k ++) if (item [I] = Tran [J] [k]) {count ++; break;} t_sup = count * 1.0/Len; if (t_sup> = min_sup) res_item [number ++] [0] = item [I];} return number-1;} // generates a candidate K-item set, returns the number of items in K sets. Int candidate_g En (char ktran [Len] [K], char kkktran [Len] [k + 1]) {char temp [K], temp1 [K], ktemp [k + 1]; int number = 0; For (INT I = 0; I <Len; I ++) {strcpy (temp, ktran [I]); bool flag; For (j = I + 1; j <Len; j ++) {strcpy (temp1, ktran [I]); For (INT m = 0; m <K; m ++) {If (M <K-1 & temp [m] = temp1 [m]) | M = k-1) {continue; flag = true;} else {flag = false; break;} If (FLAG) {If (temp [k-1]> temp1 [k-1]) {strcpy (ktemp, temp1 ); ktemp [k] = temp [k-1];} else {strcpy (ktemp, temp); ktemp [K] = temp1 [k-1]} break;} flag = judge (Kemp, ktran [Len] [k]); If (flag = true) strcpy (kktran [number ++], ktemp);} return number-1;} // determines whether the subset is in bool judge (char * srcstr, char desstr [Len] [k]) {char temp [k]; int COUNT = 0; For (INT I = 0; I <K-1; I ++) {for (Int J = 0; j <I; j ++) temp [J] = srcstr [J]; for (Int J = I + 1; j <k + 1; j ++) temp [J] = srcstr [J]; for (INT p = 0; P <Len; P ++) if (strcmp (temp, desstr [I]) = 0) {count ++; break;} If (COUNT = k-1) return true; RET Urn false;} // Apriori algorithm int Apriori (char item [Len], char Tran [length] [Len], char res_tran [length] [Len], float min_sup) {char ttran [length] [Len]; int number, Count, t_num; For (INT I = 0; I <length; I ++) for (Int J = 0; j <Len; j ++) ttran [I] [J] = '0'; number = init_pass (item, Tran [length] [Len], Len, ttran [length] [Len], min_sup); For (INT I = 0i <length; I ++) res_tran [I] [0] = ttran [I] [0]; for (int K = 2; number! = 0; k ++) {t_num = number; number = candidate_gen (res_item [number] [k-1], ttran [number] [k]); If (k = 2) continue; else {COUNT = 0; For (INT I = 0; I <number; I ++) {char temp [k]; strcpy (temp, ttran [I]); bool t_flag = false; For (Int J = 0; j <length; j ++) {// you can obtain the Count int t_k = 0 for each item in the Candidate K item set; for (INT n = 0; n <K; n ++) {bool m_flag = falsefor (int g = t_k; G <Len; G ++) {If (temp [k] = Tran [J] [g]) {m_flag = true; t_k = g; break ;}} if (m_flag = true & n = k-1) t_flag = true;} If (t_flag = true) Count ++; flag = false ;} if (count/length> min_sup) strcpy (res_item [I], temp); Count = 0 ;}} return t_num ;}

You can write the test main function by yourself. Generating Association Rules Based on a frequent K-item set is relatively simple. You only need to calculate the minimum confidence level to find all association rules that meet the conditions in the frequent K-item set.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.