Introduction
The study of finding frequent item-sets and association rules is a important part of Data Mining, which have been widely Applied to optimize marketing strategies, enhance the performance of recommendation as well as outlier detection. This is introduces some related concepts and a-priori algorithm, which effectively discovers frequent item-sets by SCA Nning data set twice for each iteration. Sequentially, experiments is conducted and analyzed. Finally, a short conclusion are made for further understanding of finding frequent item-sets and association rules. Related Concepts
Let U = {u1,u2,..., um} and I = {I1,i2,...,} is the set of baskets and items, respectively. Here, M and N is the number of baskets and items respectively. More notations related to the concept of frequent item-sets and association rules is listed in Table 1:
Given t He item-set I and Item-set J, Sij denotes the number of those baskets where item-set I and Item-set J are both contented. Smin is the threshold, also call as critical value, to segregate the frequent item-sets and Non-frequent Item-sets.cij re Presents the probability of occurrence of Item-set J assuming Item-set I already exists. Insij represents a difference value between CIJ and the frequency of Item-set J, which is beneficial when the support of I Tem-set I is much larger than, that of Item-set J. The formula of CIJ and INSIJ are defined as follows:
A-priori
In practice, to find frequent item-sets are most essential and challenging especially when the data set is too huge and Overwhelming for main memory. Therefore, A-priori algorithm was proposed in early year to improve the utilization of main memory and reduce the space fo R scanning data set. The monotonic characteristic of frequent item-sets is emphasized, stating if Item-set I was a frequent item-set, then all Of its subset is also frequent item-sets. On the contrary, if one of the subset was not a frequent item-set, item set I was definitely not a frequent item-set.
Firstly, A-priori algorithm scans the data set once to find single frequent item-sets. Owing to monotonicity, dual frequent item-sets only derives from those on single frequent item-sets. Consequently, on the second scan, we have focus on the the The dual item-sets combined by any of the single frequent item-sets an D check its support with a given support threshold smin to decide whether it belongs to dual frequent item-sets, similar T o The finding ternary frequent item-sets and higher frequent item-sets. Figure 1 clarifies the process of how A-priori works:
Experiment and analysis
To demonstrate the performance of A-priori algorithm, we implement it by Python and put it on Github (HTTPS://WWW.GITHUB.C Om/quincy1994/apriori), where we can find the code and some related data. A Visual tool called Igraph is employed for describing the distribution of frequent item-sets.
The data set from the "SMP CUP is" used to test A-priori algorithm, providing the 1043 users in CSDN pl Atform and their related blogs. We collect the titles from each user's related blogs to establish different documents, and each document contains all the Titles of related blogs for each user. An example of these documents are presented in Table 2.
We tokenize These titles and remove some stop words and single words as well as punctuation marks, remaining the main term S.we treat users ' documents as baskets while terms in document is seen as items.
Considering a relative support threshold are more adaptive to different data set than the absolute support threshold, We use pmin (smin = pmin* m) to reshape the value of smin. L1, l2,l3 represent the number of single frequent item-sets, dual frequent item-sets and ternary frequent item-sets, Respe Ctively. Table 3 shows the results on A-priori algorithm with different values of pmin, and the distributions of dual frequent ITE M-sets and ternary frequent item-sets is presented on Figure 2 and Figure 3:
The experimental results on Table 3 reveal this, L2, L3 and cost of time shrink sharply with the increase of Pmin, which D Emonstrates the advantage of A-prior algorithm by the virtue of monotonicity of frequent item-sets to decrease the CO St of counting numbers of item-sets.
The association rules is identified using confidence and interest from top-5 L2 frequent item-sets generated by the A-PRI The Ori algorithm. Table 4 shows the confidences generated from the same frequent item-sets is not symmetric. In addition, though confidence and interest can is used to discover the frequent item-sets, there is no linear relation B Etween them.
Conclusion
Because of their simplicity and practicality, frequent item-sets and association rules has been used in various Applicati ONS and data sets. This report introduces some related concepts and a efficient method, A-priori algorithm, for discovering frequent item-se Ts. Experimental results from the survey data reveal the A-prior algorithm are enable to improve the efficiency Significan tly.
However, several issues remain to be addressed. First, the documents regarded as "basket" is contented with at least items, some of which is rare among all the Docum Ents and uselessness. Second, the A-prior algorithm treats each item as the same weight while in practical, items is often weighted with differ ENT values to represent their different importance. Finally, the question of which variable is better, confidence or interest, to measure the probability of an association RU Le in different situations are still not answered clearly.