Series of articles: Learning Notes for machine learning
Recently saw the 11th chapter in "Machine Learning Combat" (using the Apriori algorithm for correlation analysis) and 12th (using the FP-GROWTH algorithm to efficiently discover frequent itemsets). As the chapter headings show, these two chapters talk about the problem of association analysis in unsupervised machine learning methods. Correlation analysis can be used to answer "which items are often purchased at the same time?" "Sort of thing. The book cited some examples of correlation analysis:
- By seeing which items are often purchased together, you can help the store understand the user's buying behavior. The knowledge extracted from the data ocean can be used in commodity pricing, market promotion, survival management and other aspects.
- Association rules were found in congressional voting records. In a congressional poll of the data set to find the relevance of the vote, (original: This is for entertainment purposes, but also can ...) Use the results of the analysis to serve political campaigns, or to predict how election officials will vote.
- Found similar characteristics of poisonous mushrooms. This is only interested in the set of items that contain a specific element (toxic), looking for some common features in the poison mushroom, and using these features to avoid eating the poisonous mushrooms.
- Found some co-existing words in the Twitter feed. For a given search term, find the frequently occurring collection of words in a tweet.
- From the news site click Stream to tap the news trends, mining which news is widely viewed by users.
- Search engine recommendation, when the user input query words, recommend the same related query terms.
Finding hidden relationships between objects from a large-scale data set is called Association Analysis (Association Analyst) or Association Rule Learning (Association rule Learning). The main problem here is that finding different combinations of items is a time-consuming task that requires a high computational cost and brute-force search does not solve the problem, so a more intelligent approach is needed to find frequent itemsets within a reasonable timeframe. This article describes how to use the Apriori algorithm and the fp-growth algorithm to solve the above problems.
1. Correlation analysis
Correlation analysis is the task of finding interesting relationships in a large-scale data set. These relationships can be of two forms:
- Frequent item sets
- Association Rules
The frequent itemsets (frequent item sets) is a collection of items that often appear together, and the Association Rule (Association rules) implies that there may be a strong relationship between the two items.
Here's an example to illustrate these two concepts: Figure 1 shows a list of transactions for a grocery store.
Transaction number |
Commodity |
0 |
Soy milk, lettuce |
1 |
Lettuce, diapers, wine, beets |
2 |
Soy milk, diapers, wine, orange juice |
3 |
Lettuce, soy milk, diapers, wine |
4 |
Lettuce, soy milk, diapers, orange juice |
Figure 1 Trade list for a grocery store
Frequent itemsets are the collections of goods that often appear together, the set of pictures (wine, diaper, soy milk) is an example of frequent itemsets. The Association rules for wine, such as diapers, can also be found in this data set, meaning that if someone buys a diaper, he is likely to buy the wine as well. (Extended reading: Diapers and beers?) )
We measure these interesting relationships with support and credibility. The support level of an item set is defined as the proportion of records in the dataset that contain the itemsets. For example, {soy milk} has a support of 4/5,{soy milk, diaper} with a support level of 3/5. Support is for itemsets, so you can define a minimum level of support , leaving only the set of items that meet the minimum scale.
The confidence or confidence level (confidence) is defined for association rules. The credibility of the rule {diaper}->{beer} is defined as "support ({diaper, beer})/support ({diaper})", because the support for {diaper, beer} is 3/5 and the diaper support is 4/5, so the "diaper-and-beer" confidence level is 3/4. This means that for all records containing "diapers", our rules apply to 75% of those records.
2. Apriori principle
Let's say we have a grocery store that operates 4 commodities (goods 0, 1, 2 and 3), and 2 shows all possible combinations of all products:
Figure 2 All possible itemsets combinations in collection {0,1,2,3,4}
For the support of a single itemsets, we can calculate by iterating through each record and checking whether the record contains the set. It is unrealistic to repeat the calculation process for a combination of \ (2^n-1 \) Set of itemsets for a dataset containing items in N.
The researchers found a so-called apriori principle that could help us reduce our computational capacity. The Apriori principle is that if an item set is frequent, then all its subsets are also frequent. The more common is its inverse no proposition, that is, if an item set is non-frequent, then all its superset is also infrequent .
In Figure 3, it is known that the shadow Itemsets {2,3} are non-frequent. With this knowledge, we know that the itemsets {0,2,3},{1,2,3} and {0,1,2,3} are also non-frequent. That is, once you have calculated the support for {2,3}, knowing that it is not frequent, you can immediately exclude {0,2,3}, {-i}, and {0,1,2,3}.
Figure 3 shows all possible itemsets, where the non-frequent itemsets are represented in gray.
3. Use the Apriori algorithm to discover frequent sets
As mentioned earlier, the goal of association analysis includes two items: Discovering frequent itemsets and discovering association rules. You first need to find the frequent itemsets before you can get the association rules.
Apriori algorithm is a method of discovering frequent itemsets. The two input parameters of the Apriori algorithm are the minimum support degree and the data set. The algorithm first generates a list of itemsets for all the individual elements. The data set is then scanned to see which itemsets meet the minimum support requirements, and those that do not meet the minimum support level are removed. Then, the remaining collection is combined to produce a set of items that contain two elements. Next, re-scan the transaction, removing the itemsets that do not meet the minimum support level. The process repeats until all itemsets are removed.
3.1 Generating the candidate set
The pseudo-code for the data set scan is roughly as follows:
Tran for each transaction in the dataset:
???? For each candidate set can:
???????? Check if can is a subset of Tran
???? If yes, increase the count of can
For each candidate set:
???? If its support is not less than the minimum value, the set of items is preserved
Returns a list of all frequently-set items
Correlation analysis using the Apriori algorithm and the fp-growth algorithm