Association Rules
?
- Item and item set
The smallest unit information that is indivisible in a database is called an item (or item), represented by a symbol, and a collection of items is called an item set. Set is the set of items, the number of items in the collection is called-itemsets. For example, the collection {beer, diaper, milk powder} is a 3-item set.
- Transaction
A set is a collection of all the items in the database, and the transaction database consists of a series of uniquely identified transactions. Each transaction contains a subset of the set of items. For example, a customer buys multiple items at the same time in a shopping mall, which has a unique identifier in the database to indicate that the goods were purchased the same time by the same customer, and that the user's purchase activity corresponds to a database transaction.
- Frequency of itemsets (count of support degrees)
The number of transactions, including Itemsets, is called the frequency of the itemsets (support count).
- Association Rules
Association rules are shaped like the implicit, which is the true subset, and. A precondition called a rule, called the result of a rule. When an association rule reflects the occurrence of an item in the, the project also follows the pattern that appears.
- Support for Association Rules
The degree of support for an association rule is the ratio of the number of transactions that are contained in the transaction set to the number of transactions, which reflects the frequency at which the items contained in and in the transaction set occur simultaneously, as
???? Supportsupport????????????
(1)
?
- Confidence level of association rules (confidence)
The confidence level of an association rule is the ratio of the number of transactions contained in a set of transactions to the number of transactions contained in all trades, as recorded, and the confidence reflects the conditional probabilities that occur in the included transactions. That
????????????????????????
(2)
?
3
?
- Minimum support and minimum confidence level
Typically the user needs to specify the support and confidence thresholds that the rule must meet in order to achieve a certain requirement, which is called the Minimum support threshold (MIN_SUP) and the minimum Confidence threshold (min_conf). Min_sup describes the minimum importance of association rules, and MIN_CONF specifies the minimum reliability that association rules must meet.
- Strong Association Rules
Moreover, the association rule is called Strong Association rule, otherwise it is known as weak association rule. Generally speaking, association rules are generally referred to as strong-connected rules.
- Frequent item sets
For an item set, all the set of items in the transactional database that meet the minimum support level specified by the user, that is, non-empty-empty sets that are not less than, are referred to as frequent or large item sets.
- The theory of project set space
Agrawal and others set up the project set space theory for transaction database mining. The core of the theory is that a subset of frequent itemsets is a frequent item set; a superset of a non-frequent item set is a non-frequent item set.
?
Apriori algorithm principle
- The basic idea of Apriori algorithm
The basic idea of the Apiori algorithm is to compute the support degree of itemsets by multiple scans of the database, and discover all the frequent itemsets to generate association rules. The Apriori algorithm scans the data set multiple times. The first scan gets a collection of frequent 1-itemsets , which first uses the results of the first scan to produce a set of candidate-Itemsets, then determines the support of the elements in the scanning process, and finally calculates the collection of frequent-itemsets at the end of each scan, and the algorithm ends when the set of candidate-itemsets is empty.
- The process of generating frequent itemsets by Apriori algorithm
The process of generating frequent itemsets is mainly divided into two steps: connection and pruning.
- Connection step.
to find , a collection of candidate-itemsets is produced by connecting with itself.
Set and yes in the itemsets. The first item to be denoted. The Aprior algorithm assumes that the items in a transaction or item set are sorted in a dictionary order, and for the itemsets, the corresponding items are sorted. If the element and the previous counterpart are equal, then and can be connected. That is, if, and can be connected. The condition guarantees that no duplication is generated, while searching for frequent itemsets in order avoids searching and statistical work on itemsets that cannot occur in the transactional database. The set of connected and produced result items is.
- Pruning step.
- needs to be validated to remove the non-frequent-itemsets that do not meet the support level.
Main steps of
- apriori algorithm
- scan all data, The collection that produces the candidate 1-itemsets
- based on minimum support, the set of candidate 1-itemsets
-
-
- A collection of frequent-itemsets is generated from the set of candidate-itemsets based on the minimum support degree.
-
- based on the minimum confidence level, a strong association rule is generated by the frequent itemsets, ending.
- apriori algorithm description
Input: Database , minimum support level threshold.
Output: The frequent set in.
- Begin
- =1-frequent itemsets;
- DO begin
- {Call function generates candidate-itemsets through frequent-itemsets}
- For all databases do begin {scan for Count}
- {Use subset to find all the subsets of the candidate in the transaction}
- For all candidate sets Do
- End
- End
- End
- Return {form a set of frequent itemsets}
?
An example analysis of Apriori algorithm
?
Table 1 List of transactions for the database
transaction |
List of commodity IDs |
transaction |
List of product IDs |
t100 |
|
t600 |
|
t200 |
|
t700 |
|
t300 |
|
t800 |
|
t400 |
|
t900 |
|
T500 |
|
? |
? |
?
With a minimum support count of 2, which is min_sup=2, the process of generating candidate itemsets and frequent itemsets using the Apriori algorithm is shown below.
- First-time scan
Scan the database for a count of each candidate:
????????????????????????????????????????????
Itemsets |
Support Level Count |
|
6 |
|
7 |
|
6 |
|
2 |
|
2 |
Itemsets |
Support Level Count |
|
6 |
|
7 |
|
6 |
|
2 |
|
2 |
?
?
?
?
?
?
?
?
?
?
?
?
because the minimum transaction support number is 2, no items are deleted. You can determine the collection of frequent 1-itemsets , which consists of candidate 1-item sets with minimal support.
- Second scan
to discover a collection of frequent 2-itemsets , the algorithm uses a collection that produces a candidate 2-item set. There are no candidates to remove from the pruning step, because each subset of these candidates is also frequent.
?
- Third scan
- Fourth scan
Association Rules 1