Frequent patterns mining (frequent pattern Mining) _ Data Mining

Source: Internet
Author: User

Frequent patterns mining (frequent pattern Mining) is a kind of mining commonly used in data mining, which is a frequent pattern mining algorithm called Apriori. First look at what is called frequent mode. ~ is the pattern that often comes together, the "pattern" here is a more abstract concept, we look at a specific example, that is the famous "beer and diapers" story ~ said that in the United States have babies in the family, is usually the mother at home to look after the baby, the young father went to the supermarket to buy diapers. When a father buys a diaper, he often buys his own beer, so that the two seemingly unrelated items of beer and diapers often appear in the same shopping basket. If the young father can only buy one of these two items in the store, he is likely to give up shopping and go to another store until he can buy both beer and diapers at once. Wal-Mart found this unique phenomenon and began trying to place beer and diapers in the same area in the store, so that the young father can find the two products at the same time, and quickly finished shopping, and Wal-Mart can allow these customers to buy two items at a time, rather than one, thus obtaining a good sales income of goods, which is The origin of the "beer and diapers" story. For example, in the supermarket sales record, often found that milk and bread are often bought together, then the milk and bread of these two item often together now in the sales record, so here milk and bread is one can be seen as a frequent pattern, of course, look at the milk alone, look at the bread alone, but also frequent mode.

So frequent pattern mining is to find out these frequently occurring patterns, as for this "frequent" is how to define it. That depends on the settings in the algorithm. We look at the Apriori algorithm, first of all need to introduce a few concepts to facilitate the understanding of the algorithm:

1 Support degree: Represents the proportion of an item collection appearing in the datasheet.

2 K-candidate set: a combination of frequent sets of K-1 items, the support degree is greater than or equal to the specified support degree of a set of K, for computing K frequent set use.

3 K-Item frequent set: The support degree is greater than or equal to the specified support degree containing a K-item set, calculated by the K candidate set. After reading these three definitions, is not very vacant ... I used to like to give examples, because it is easy to understand the example, the following example to see the algorithm, we will understand the above terminology is what.

Assuming there is such a table above, we want to find out which items are often together, what the item is, depending on the specific problem, such as the goods sold, or the user's tag and so on. Now let's start with the calculations. Starting with the 1-item set, just consider how many records an item appears in, and then the percentage of the number of records in that item set, which is the support level, calculated:

The set of items in the above table is the K-candidate set, which is the 1-item candidate set. In the algorithm, we need to set a minimum support to filter out some infrequent itemsets, assuming that the minimum support is set to 0.25, then the support of less than 0.25 of the set will be filtered out, get the following table:

The set of items that are calculated in the table above is the K-term frequent set, which is the 1-item frequent set, which is considered to be frequent when the minimum support level we set is 0.25. After calculating the 1-item frequent set, and then continue to expand, calculate 2-item frequent set, that is, what kind of two item frequently appear. When calculating k+1-items, note the conclusion that if k+1 elements constitute frequent itemsets, then the subsets of any k elements of it are also frequent itemsets. For example, if milk and bread appear frequently together, then milk is also frequent. Bread also appears frequently. That's the point of this conclusion. Therefore, any subset of the frequent set of k+1-items must also be frequent, what is the use of this? is of great use, which allows us to generate k+1-candidate sets using K-term frequent sets, since any subset of the frequent set of k+1-items must be a set of frequent itemsets previously calculated, a k+1-candidate set can be generated by combining the frequent itemsets previously calculated, and the combination method is: 22 combination, which satisfies the 22 pre K-1 element, the last element requires that the product name of the previous record be less than the product name of the latter record, so as to avoid repeating the combination. So the 2-entry candidate set is as follows:

Select the set of items with a support degree greater than or equal to 0.25, which is a 2-item frequent set:

The same method is used to compute the 3-item candidate set, and get:

Select a 3-item frequent set of items with a support degree greater than or equal to 0.25, and get:

Continue to calculate, until the calculated frequent set is empty, the example of the 4-item frequent set is empty, so the maximum item set is 3-item frequent set. In the 2-item frequent set, it means that this means that I1 and I3 occur at the same time, or that I2 and I3 occur at the same time, or that I2 and I4 occur simultaneously, or that the probability of I3 and I4 appearing simultaneously is greater than or equal to the set minimum support, which is greater than 0.25. and the 3-item frequent set {I2,i3,i4} represents {I2,I3,I4} The probability of simultaneous occurrence is greater than equal to the set minimum support degree, that is, greater than or equal to 0.25.

This algorithm is very useful, because we often care about what and what often appear together, this algorithm provides me with a solution.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.