The 11th Chapter uses the Apriori algorithm to carry on the correlation analysis
A Lead
The problem of "beer and diaper" belongs to the classic correlation analysis. In the retail industry, the pharmaceutical industry, etc. we often need to be related to analysis. The reason why we use correlation analysis is to find some interesting relationships from a large amount of data. These interesting relationships will provide a guiding role in our work and life.
Two Basic concepts of correlation analysis
The so-called correlation analysis is to find some interesting relationships from a huge amount of data. Correlation analysis It has two goals, one is to find frequent itemsets and the other is to discover Association rules.
The four concepts commonly used in association analysis are: frequent itemsets, association rules, confidence, and support. Frequent itemsets refer to subsets of data that occur frequently at the same time, where the frequency is usually determined by the degree of support (and you can, of course, be based on other metrics). The degree of support refers to the probability of the occurrence of the most data in the set. Association rules refer to the association between two individuals, which is generally measured in terms of confidence. The confidence degree refers to the conditional probability that the itemsets appear, and the condition of the conditional probability is the condition in our association rule. such as diapers and beer, then the condition is the diaper.
Three Apriori principle
Apriori, as its name implies, is the use of transcendental knowledge to judge unknown knowledge.
we know that if an item set is a frequent itemsets, then its subset must also be a frequent itemsets. For example , { beer, diaper } is a frequent item set, then { diaper } and { beer } must also be a frequent item set. If you think backwards, which means that an item set is not a frequent itemsets, then its superset is not a frequent itemsets. the Apriori algorithm takes advantage of this feature to greatly reduce the complexity of the frequent itemsets in the association analysis.
Four Apriori Algorithm
If you solve all possible frequent itemsets directly, then it is too complex to tolerate, so in order to reduce the complexity of the problem, we are based on Apriori principle, the Apriori algorithm is proposed. In other words , the Apriori algorithm is a simple method to solve frequent itemsets.
features of the Apriori algorithm:
Its advantages are: The algorithm is simple, easy to implement
Its disadvantage is that it is not suitable for large data sets
Its type of application is: nominal type data
the process of the Apriori algorithm is to first generate an item set for a single item, then remove the non-conforming itemsets based on the minimum support, then make 22 combinations of the remaining items, and then delete the non-conforming itemsets based on the minimum support, and so on, Until all itemsets that do not meet the minimum support level are removed.
here is the implementation python code for the Apriori algorithm :
1. Get the data:
2. Generate an item set for a single item:
3. Remove the non-conforming itemsets based on the maximum support level:
4. write a function to get the set of items containing the k Item
5.Apriori Algorithm
Five Mining Association rules from frequent itemsets
The measure of frequent itemsets is the degree of support, and the measure of association rule is the confidence degree. When the confidence of a rule satisfies a certain value, we say that the rule is an association rule. Association rules have a similar nature to frequent itemsets. When a rule does not satisfy the minimum confidence level, then the subset of the rule does not satisfy the minimum confidence level, in other words, if you can start from the latter size of the rule 1, constantly generate new rules. (Here is the so-called back piece, equivalent to a conclusion, such as diapers --and beer, where the diaper is the front piece, the beer is the rear).
from the algorithm of Solving association rules, he and Apriori algorithm are similar, but it has a new name called Classification algorithm. It is to create a rule with a size of 1 , then delete the rules that do not satisfy the confidence level, and then use the rest of the rules to produce a rule with a size of 2 , then delete the rules that do not satisfy the confidence level, and so on.
Here's the Python code:
1. First we create the main function
2. Next we get the rules based on the minimum confidence level
3. We create rules
Six Summarize
The purpose of association analysis is to look for interesting relationships in the data, where interesting relationships have two meanings, one that is often found at the same time, which is what we often say about finding frequent itemsets, and the other is to satisfy the items that are so derived, that is, what we often say about finding association rules. Frequent itemsets are measured by the degree of support, and association rules are measured by confidence.
we need to combine the results when we do the correlation analysis, but we know that the combination is time consuming, in order to simplify the calculation and reduce the solution space, we use the Apriori algorithm. The basic idea of the algorithm is that the superset of a non-frequent itemsets is also a non-frequent itemsets. This concept can also be extended to the association rules, at this point, a rule is not an association rule then its superset is not the association rule, the algorithm is called the classification algorithm.
Although The Apriori algorithm can reduce the computational amount to a certain extent, but because it needs to re-traverse the database every time the frequent itemsets change, it is not suitable for large data, and in order to solve this problem , the fp-growth algorithm is proposed. the fp-growth algorithm and the Apriori algorithm only need to traverse two databases, the speed has a great increase.
Apriori algorithm idea and its Python implementation