I. Frequent Patterns in Association Rules
Association rules (Association Rule) is an important model invented and widely studied in the field of database and data mining,The main purpose of association rule data mining is to find out:
[Frequent mode ]:Frequent Pattern, that is, cooccurrence relationships, that is, the relationship that appears at the same time. The relationship between frequency and concurrency is also called Association ).
Ii. Typical Case of Applying Association Rules: Classic marketing case of "beer and diapers" in Walmart
Basket Analysis: by analyzing the associations between items in the customer's shopping basket, you can explore the shopping habits of customers and help retailers better develop targeted marketing strategies.
The following is an example of the simplest and most classic association rule: Baby Diapers-> beer [Support = 10%, Confidence = 70%]
This rule indicates that 10% of all customers bought baby diapers and beer at the same time, and among all customers who bought baby diapers, 70% of users also bought beer. After discovering this association rule, supermarket retailers decided to put baby diapers and beer together for sale, and the results significantly increased sales, this is the classic marketing case of "beer and diapers" at the Walmart supermarket.
3. Support and confidence)
In fact, support and confidence are two important indicators to measure the intensity of association rules. They reflect the discovered rules respectively.UsefulnessAndCertainty.
[Support] rule X-> Y: Percentage of items that contain x u y in the complete transaction set.Support (a B) = P (a B)SupportUsefulnessIf the support level is too small, it means that the corresponding rules are only incidents. In commercial practice, incidents are likely to have no commercial value. Confidence level: the confidence level of rule X-> Y: the percentage of things that contain both X and Y to the number of things that contain X.Confidence (a B) = P (B |)Confidence LevelCertainty (predictability)If the confidence level is too low, it is difficult to reliably infer y from X. Rules with a low confidence level are not very useful in practical applications.
Iv. Apriori algorithm
[Basic Concepts]
1 [database]: stores the record set (D) of the two-dimensional structure; 2 [all item sets (items)]: the set of all items (I ); 3 [transaction]: A record (t, t belongs to D) in the database; 4 [itemset]: a set of items that appear at the same time. Defined as: K-itemset (k-itemset), K-itemset? T. Unless otherwise specified, the K values listed below represent the number of items. 5 [candidate itemset]: The item set is obtained through downward merging. Defined as C [k]; 6 [strong rules]: After Association Rule Analysis, the ratio of Sales Promotion (based on a rule) to blind sales (generally the entire data) for some people is, the higher the ratio, the better. 7. [pruning step] the filtering process is the pruning step only when the subset is a candidate set with frequent sets;
The Apriori algorithm is the most famous algorithm among many association rule data mining algorithms. Its core is based onRecursive Algorithms for Two-Phase Frequency brainstorming. This association rule belongs to single-dimension, single-layer, and boolean association rules in classification.
The algorithm consists of the following two steps:
- Generates all frequent project sets. A frequent itemset is a project set with a higher support level than the Min-Sup.
- Generate all trusted association rules from frequent projects. Here, trusted Association Rules refer to rules with a confidence level greater than the Min-conf threshold value.
Then, use the frequency set found in step 1 to generate the expected rules and generate all rules containing only the items in the set. Each rule has only one rule on the right, the rule definition is used here. Once these rules are generated, only those rules that are greater than the minimum credibility given by the user are left. Recursive methods are used to generate all frequency sets.
[Apriori algorithm]
The Apriori algorithm uses the prior knowledge of frequent item sets and uses a layer-by-layer search algorithm.IterationMethod,K item set for exploration (K+ 1.Step 1: scan the transaction records to find all the frequent 1-item sets, which are merged into L1;
Step 2: Use L1 to find the L2 set of two frequent item sets;
Step 3: L2 query L3,
...
Step N: continue until you can no longer find any frequent K-item sets.
Finally, we can find strong rules in all the frequent sets to generate association rules that interest users. Among them, the Apriori algorithm has the property that all non-empty subsets of any frequent item set must also be frequent. If P (I) <Minimum Support threshold, when element a is added to I, The result item set (a distinct I) cannot appear more times than I. Therefore, a too I is not frequent.
However, the existence of a large number of candidate sets and the need to scan the database repeatedly are two major disadvantages of the Apriori algorithm.
V. Use of association rule Algorithms
Association rule algorithms are not only used in the analysis of numeric datasets, but alsoPlain text and web filesIt also plays an important role. For exampleConcurrent relationship between words and web usage modeThese are the basis for Web data mining, search, and recommendation.