The connection, in fact, is very simple, is a few things or events are often at the same time, "Beer + diapers" is a very typical two related products.
The so-called association, which reflects the knowledge of dependencies or associations between an event and other events. When we look up English literature, we can find that there are two English words that can describe the meaning of association. The first is correlation relevance, and the second is correlation association, both of which can be used to describe the degree of association between events. The former is mainly used in the content of the Internet and documents, such as search engine algorithms in Chinese files between the association, we use the word is relevance, the latter is often used in the actual things, such as e-commerce sites on the relationship between the goods we are expressed by association, The association rules are expressed in the association rule.
If there is an association between two or more properties, the property value of one of the items can be predicted based on other property values. In simple terms, association rules can be expressed in this way: A→b, where A is called a precondition or left part (LHS), and B is called a result or right (RHS). If we're going to describe the rules about diapers and beer (people who buy diapers also buy beer), then we can say: buy diapers → buy beer.
Two Concepts of association algorithm
An important concept in the association algorithm is the support degree (Support), which is the probability that a dataset contains a few specific items.
For example, the number of beers and diapers in the 1000 http://www.aliyun.com/zixun/aggregation/31946.html "> Commodity trading" was 50 times, and the correlation was 5%.
Another concept that is very relevant to association algorithms is confidence (confidence), which is the probability that B occurs when a is already present in the dataset, and the formula for calculating the reliability is: the probability of A and b appearing at the same time.
Data Association is a kind of important knowledge that can be discovered in database. If there is a regularity between the values of two or more variables, it is called an association. The association can be divided into simple association, Sequential Association, causal Association and so on. The purpose of association analysis is to find out the hidden network of links in the database. Sometimes the correlation function of the data in the database is not known, or even if the knowledge is uncertain, the rules generated by the association analysis have confidence.
Association rule mining finds interesting associations or related links between item sets in a large number of data. It is an important subject in data mining, which has been extensively studied by the industry in recent years.
A typical example of association rule mining is shopping basket analysis. Association rules research can help to find the relationship between different items in a transaction database, and find out the customer buying behavior mode, such as the effect of buying a certain commodity on other goods. The analysis results can be applied to the layout of goods shelves, the arrangement of storage and the classification of users according to the purchase mode.
The discovery process of association rules can be divided into the following two steps:
The first step is to identify all frequent itemsets (frequent itemsets) iteratively, requiring that the support of frequent itemsets be no less than the minimum user-defined value;
The second step is to create the association rules from the rule that the confidence degree is not lower than the minimum value set by the user in the frequent project centralization. Identifying or discovering all frequent itemsets is the core of association rule Discovery algorithm, and it is also the most computationally important part.
The two thresholds for support and confidence are the two most important concepts that describe the association rules. The frequency of a project group is called the support degree, which reflects the importance of association rules in the database. And reliability measures the credibility of association rules. If a rule satisfies both the minimum support (min-support) and the minimum confidence level (min-confidence), it is called a strong association rule.
Association rule Data Mining phase
The first phase must identify all high-frequency project groups (SCM itemsets) from the raw data collection. High-frequency means that the frequency of a project group must reach a certain level relative to all records. As an example of a 2-itemset containing a and B two projects, we can obtain support for the {A,B} project group, which is known as the HF project group if the support is greater than the minimum support (Minimum Support) threshold set. A k-itemset that satisfies the minimum support is called the High-frequency K-project group (frequent K-itemset), which is generally represented as SCM K or frequent K. The algorithm then attempts to produce a project set SCM k+1 longer than K from the project team in SCM K until a longer high-frequency project group can be found.
The second Stage of association rule Mining is to generate association rules. Generating association rules from High-frequency Project group is to generate rules by using the High-frequency K-project group of the previous step, under the condition threshold of the minimum credibility (Minimum confidence), if the confidence obtained by a rule satisfies the minimum reliability, then the rule is called Association rule.
For example, the rule generated by the high frequency K-project group {a,b} is called {A,B} as the association rule if its credibility is greater than or equal to the minimum confidence level.
In the case of "beer + diapers", using association rules mining technology to data mining the records in transaction database, we must first set two thresholds of minimum support and minimum confidence, in which the minimum support degree is min-support=5% and the minimum confidence level is assumed min-confidence =65%. Therefore, the requirement-compliant association rules will have to meet the above two conditions. The association rules for diapers, beer} will be acceptable if the associated rules found in the excavation (diaper, beer}) meet the following conditions. The formula can be described as:
Support (diaper, beer) ≥5% and confidence (diaper, beer) ≥65%.
Among them, Support (diaper, beer) ≥5% in this application paradigm: In all transaction data, at least 5% of transactions present diapers and beer, both of which are purchased at the same time. Confidence (diaper, beer) ≥65% The meaning of this example is that at least 65% of all trading records containing diapers are purchased at the same time.
Therefore, in the future, if a consumer appears to buy diapers, we will be able to recommend the consumer to buy beer at the same time. The product's recommended behavior is based on the {Diaper, Beer} Association rules, which support the "most diaper-buying deals that buy beer at the same time" in terms of past trading records.
It can also be seen from the above introduction that association rule mining is usually more suitable for the discrete value of the index in the record.
If the index value in the original database is a sequential data, then, the data discretization is the important link before the data mining, and the discretization process will directly affect the mining result of the association rule before the association rule mining should be carried out properly.