Data Mining algorithm: Correlation analysis One (basic concept)

Source: Internet
Author: User

I. Basic CONCEPTS

Let's look at the transaction library above, as the two-dimensional dataset shown in the table above is a shopping basket transaction library. The thing bank records the behavior of the customer buying the goods. The TID here represents the number of a purchase, and items indicate which products the customer has purchased.

  Transaction:

Each record in the transaction library is referred to as a transaction. In the shopping basket transaction in the table above, each transaction represents a shopping behavior.

  Itemsets (T):

A collection that contains 0 or more items is called an itemsets. In a shopping blue transaction, each product is an item, and a purchase behavior consists of multiple items that combine items together to form an item set.

  Support Degree count:

The number of occurrences of the itemsets in the transaction. For example, {Bread,milk} This set of items appears in the transaction library 3 times, so its support count is 3.

  Support degree (s):

The proportion of transactions that contain itemsets in all transactions: In the example above we got {Bread,milk} The support count for this set is 3, and there are 5 transactions in the repository, so the support of {Bread,milk} is 3/5.

  Frequent Item Sets:

If we set a minimum threshold for support for an item set, then all itemsets with a support level greater than this threshold are frequent itemsets.

  Association rules:

After understanding the above basic concepts, we can introduce the association rules in the Association analysis.

Association rules are actually implication expressions between two itemsets. If we have two disjoint itemsets X and Y, we can have rule x→y, for example {Bread,milk}→{diaper}. The combination of itemsets and itemsets can produce a lot of rules, but not every rule is useful, and we need some qualifications to help us find high-intensity rules.

  Support degree (s):

The degree of support for association rules is defined as the proportion of transactions that contain two itemsets of both X and Y. We see {bread,milk}→{diaper} This example, there are 2 items in the transaction that contain the set of {Bread,milk,diaper}, so the support of this rule is 2/5.

  

Low-support rules can only happen by chance, and support is often used to remove meaningless rules. Also has an expected nature that can be used for association rule discovery.

  Confidence level (c):

The confidence level of an association rule is defined as: This definition determines how frequently y appears in a transaction that contains X. or see {Bread,milk}→{diaper} This example, the transaction containing the {Bread,milk} item has occurred 2 times, the transaction containing {Bread,milk,diaper} has occurred 2 times, then the confidence level of this rule is 1.

  

The confidence measure is reliable through rule inference. For a given rule, the higher the confidence, the greater the likelihood that Y will appear in the transaction that contains X. The confidence level can also estimate the probability of y in the given x condition.

It is meaningful to define these two metrics for an association rule. First, it filters out meaningless rules by limiting the degree of support for rule support. From the point of view of businessmen, the significance of data mining is to make the corresponding strategic decision by mining value. If a rule has a low level of support, it means that customers are buying them at the same time, and it makes little sense for the merchant to make decisions about the rules. Second, the greater the confidence, the more reliable the rule.

Association rule Discovery :

With these two metrics, you can qualify all the rules and find the rules that make sense to us. First, the minimum thresholds minsup and minconf are set for support and confidence respectively. Then find all the association rules for the support degree ≥minsup and confidence ≥minconf in all the rules. For a given transaction set T, Association Rule Discovery refers to all rules that find a support degree greater than or equal to the threshold minsup and the confidence level is greater than or equal to minconf.

One thing we should note is that the inference from simple Association rules does not include causality. We can only get a→b A and B obviously happen simultaneously, but we can not conclude that a is the result, B is the fruit. That means we can only get from the case.

One of the original methods of Mining association rules is to calculate the support and confidence of each possible rule, but at a high cost. Therefore, the method of high performance is to split the support degree and confidence level. Because the degree of support for a rule relies primarily onx∪y , so most association rule mining algorithms typically employ a strategy that is decomposed into two steps:

  frequent itemsets are created with the goal of discovering all itemsets that meet the minimum support threshold, called frequent itemsets (frequent itemset).

  rules are produced with the goal of extracting high-confidence rules from the frequent itemsets obtained in the previous step, called strong rules (strong rule). often the calculation required for the generation of frequent itemsets is much larger than the calculation cost generated by the rule.

Data Mining algorithm: Correlation analysis One (basic concept)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.