Association rule mining algorithms:

Source: Internet
Author: User

Association rule mining algorithms:
The so-called Association Rule mainly refers to the concurrency relationship in the index data. The most typical application is the analysis of the shopping basket. It is found that all dads buy beer when they buy diapers.
Association Rule Mining has some confusing terms, see http://blog.sina.com.cn/s/blog_4d8d6303010009kb.html, there are two important concepts: support and confidence. In layman's terms, the support level is the number of customers who have bought diapers and beer. (If the support level is too low, this rule is rarely used and has no application value ); confidence level is how many of the customers who buy diapers buy beer (the confidence level is too low, it means that the rule for buying diapers is unreliable)
The essence of the algorithm is very simple: first, find all the rules that only contain one project, then synthesize the rules that contain two projects (pruning is required in the middle), and so on.
Merge or algorithm flowchart: Candidate-gen function flowchart:
Next, I will use a practical example to describe the process of the Apriori algorithm: assume that the minimum supported degree is minsup = 30%, and the minimum confidence level is mincond = 80%. We will introduce two very important terms: k-item frequency set (a set of public security and minsupdle projects with support for k projects ). Downward Closure Property is the core of the algorithm. If an item set is a frequent project set, all its non-empty subsets are also frequent project sets. Obviously, if {1, 2} is a frequent item set, {1}, {2} must be a frequent item set (because a transaction containing 1, 2 must contain 1, however, transactions that contain 1 do not necessarily contain 1, 2 ). The first major step of the Apriori algorithm is to use the Fk-1 of the frequent item set generated in the previous round (usually initialized to a frequent set containing only one project ), use the candidate-gen function to generate a candidate item set Ck. Here we will introduce the candidate-gen function. This function consists of merging and pruning. Merge: uses two (k-1) frequent item set to generate k frequent item set, first note that the items in each item set are sorted in a certain order, this requires that the first K-2 items of the two frequent item sets be the same and only the last one is different. Specifically, {1, 3} and {1, 5} are frequently merged into {1, 3, 5 }. You must have asked {1, 3} and {3, 5} to synthesize {3, 3, 5? The answer is no, because according to the downward closure principle, if {1, 3, 5} is a frequent set, {1, 3} {1, 5} {3, 5} is a frequent set, if {} is a frequent set, {} and {} can be directly used to synthesize {, 5}. {} is not necessary }, so the merge here must require that the previous K-2 items are the same and only the last item is different. Pruning: checks whether the subset of Ck in all candidate sets is a frequent set. If one subset is not, deletes Ck. For example, {1, 3, 5} generated through {1, 3} and {1, 5} is earlier. If {3, 5} is not a frequent set, {1, 3, 5} is certainly not a frequent set. The last step of the Apriori algorithm is to check the support of all the item sets in the new candidate set, and perform the next round of filtering if they meet the requirements. Finally, let's talk about how to generate rules based on the frequently generated item sets. If frequent project set f is generated based on the Apriori algorithm, all association rules can be expressed as (f-a)->, in addition, we can continue to infer (f-B (B is a subset of a)-> B, which must be true. Because the confidence level is f. count/(f-). the smaller the set, the larger the f-a set (f-). the smaller the count, the larger the overall confidence level. This is the so-called data sparse principle (the larger the data set, the smaller the number of occurrences ). The process of generating rules is similar to the Apriori algorithm. First, we can find all the rules with the first part of the frequent set f, and then use the candidate-gen function and so on, except that we can replace the support test with the confidence test.
An example of a specific implementation of Apriori: http://blog.csdn.net/lskyne/article/details/8302478
Association rules must be implemented for the online urgent apriori algorithm

Abstract
With the development of the information age, the amount of information grows exponentially. People find it increasingly difficult to obtain useful information from these massive amounts of information. It is even harder to find out the hidden rules behind the information. Data Mining is a new technology used to obtain useful information from a large amount of data. association rule mining is one of the data mining methods. This article discusses in detail the design and development process of association rule mining system based on the Apriori algorithm. The system converts the Bitmap matrix of the transaction database based on the classic Apriori algorithm, which greatly improves the search efficiency and can separately mine frequent item sets and association rules.
The paper is organized as follows: Firstly, it introduces the generation, definition and application of data mining, then describes the basic concepts of association rule mining, then analyzes the requirements of the system, and puts forward the design scheme; the system is followed by the specific implementation of the system. Finally, the system is tested and used to mine the drug groups in the traditional Chinese medicine prescription library, which verifies the correctness and practicability of the system.
Key words: data mining; Association Rules; Apriori algorithm

Requirement Analysis and Design Scheme
4.1 Requirement Analysis
Because the transaction database generally only provides access and retrieval functions for a large amount of data, the general usage of users can be satisfied. However, it is precisely because the database stores a large amount of data and different data items, there are also a large number of hidden, unknown, and meaningful data relationships between multiple data items. These relationships play an important role for users, therefore, data mining is generated in this situation. Association rule mining is an important rule in data mining. The Apriori algorithm is also a classic algorithm for association mining. It can discover interesting associations and related relationships between items in a large amount of data. With the continuous collection and storage of large amounts of data, many people in the industry are increasingly interested in mining association rules from their databases. Discovering interesting associations from a large number of business transaction records can help with the formulation of many business decisions, such as classification design, cross-shopping and promotion analysis.

1 Introduction
With the rapid development of database technology and the wide application of database management systems, people accumulate more and more data. The surge in data hides a lot of important information, and people want to analyze it at a higher level to make better use of the data. The current database system can efficiently implement data input, query, statistics, and other functions, but it cannot discover the relationships and rules in the data and predict future development trends based on the existing data. The lack of means to find the hidden knowledge behind the data has led to the phenomenon of "data explosion but poor knowledge. As a result, data mining technology came into being and showed great vitality. Data Mining involves a large number of incomplete, noisy, fuzzy, and random data, the process of extracting potentially useful information and knowledge hidden in it that people do not know beforehand. It extends human abilities to analyze problems and discover knowledge.
2. Data Mining Overview
2.1 generation of Data Mining
With the development of the information age, the amount of information increases exponentially. However, there are few tools for analyzing and processing the data. People have massive data but suffer from the lack of information. The surge in data hides a lot of important information, and people want to analyze it at a higher level to make better use of the data. The current database system can efficiently implement data input, query, statistics, and other functions, but it cannot discover the relationships and rules in the data and predict future development trends based on the existing data. The lack of hidden knowledge behind data mining has led to the phenomenon of "data explosion but poor knowledge. Information explosion is a double-edged sword: a huge amount of information is both the most important wealth and the most dangerous killer. Massive amounts of information also results in a crisis of decision-making and understanding. In the face of the challenge of "everyone is overwhelmed by data, but people are hungry for knowledge", data mining and Knowledge Discovery technologies have emerged and are booming, showing their powerful vitality.
Data mining is the result of the natural evolution of information technology. The witness of the evolution process is the development of the following functions in the database industry: data collection and database creation, data management (including data storage and retrieval, Database Transaction processing ), and data analysis and understanding (involving data warehouses and Data Mining ). For example, early development of data collection and database creation mechanisms has become an essential basis for the development of effective mechanisms for data storage and retrieval, query, and transaction processing later. As a large number of database systems that provide query and transaction processing are widely put into practice, the remaining full text is...>

The mining program that uses Matlab to implement Association Rules of the apriori algorithm, with detailed annotations

The following section describes the program for finding k frequent item sets from two frequent item sets in the apriori algorithm. There are two problems in the program:
1. It seems that the K of the while loop is always fixed, that is, the number of frequent two-item sets. Do you want to change the number of K after the first three frequent item sets? How can this problem be reflected?
2. There are two for large loops in the program, but the result is that as long as a frequent three-item set is found, the second for loop ends, but there should actually be three other frequent ones. Shouldn't the for loop be executed unconditionally until the end of the parameter k? At that time, the k value was 15, but at the end of the program I = 2, j = 3, then j would not execute 4 and until k. Why? Please give me some advice. Urgent ......
While (k> 0)
Le = length (candidate {1 });
Num = 2;
Nl = 0;
For I = 1: K-1
For j = I + 1: k
X1 = candidate {I}; % the initial value of candidate is a frequent two-item set. This indicates the I entry of the frequent item set.
X2 = candidate {j };
C = intersect (x1, x2 );
M = 0;
R = 1;
Nn = 0;
L1 = 0;
If (length (c) = le-1) & (sum (c = x1 (1: le-1) = le-1)
Houxuan = union (x1 (1: le), x2 (le ));
% Tree Pruning. If a subset of a candidate item's K-1 is not frequent
Sub_set = subset (houxuan );
% Generate all K-1 item Subsets for this candidate item
NN = length (sub_set );
% Determine if all these K-1 items are active
While (r & M <NN)
M = M + 1;
R = in (sub_set {M}, candidate );
End
If M = NN
Nl = nl + 1;
% K candidate item set
Cand {nl} = houxuan;
% Record the number of times each candidate k-item set appears
Le = length (cand {1 });
For I = 1: m
S = cand {nl };
X = X (I ,:);
If sum (x (s) = le
Nn = nn + 1;
End
End
End
End
... The remaining full text>
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.