Correlation analysis using the Apriori algorithm

Source: Internet
Author: User

Correlation analysis is a task that looks for interesting relationships in a large-scale data set. These relationships can take two forms: frequent itemsets or association rules. Frequent itemsets are collections of items that often appear in a piece, and association rules imply that there may be a strong relationship between the two items. The degree of support for an item set is defined as the proportion of records in the dataset that contain the itemsets. The confidence or confidence level is defined for an association rule such as {diaper}->{wine}. The credibility of this rule is defined as "support ({diaper-to-beer})/support ({diaper})"

Although most examples of association rule Analysis come from the retail industry, the technology can also be used in other industries, such as website traffic analysis and the pharmaceutical industry.

Apriori principle

--If an item set is frequent, then all of its subsets are also frequent. In turn, that is, if an item set is a non-frequent set, then all its superset is also infrequent.

Apriori algorithm

The two input parameters of the--apriori algorithm are the minimum support degree and the data set. The algorithm first generates a list of itemsets for all individual items. The transaction is then scanned to see which itemsets meet the minimum support requirements, and those that do not meet the minimum support level are removed. Then, the remaining collection is combined to produce a set of items that contain two elements. Next, re-scan the transaction, removing the itemsets that do not meet the minimum support level. The process repeats until all itemsets are removed.

The pseudo code is as follows:

When the number of items in the collection is greater than 0 o'clock

Build a list of candidate itemsets consisting of k items

Check the data to confirm that each itemsets is frequent

Keep frequent itemsets and build a list of candidate itemsets consisting of k+1 items

Mining Association rules from frequent itemsets

The quantitative index of association rules is called credibility. The credibility of a rule P-H is defined as support (p| H)/support (P).

Similar to the generation of frequent itemsets, we can produce many association rules for each frequent itemsets. If you can reduce the number of rules to ensure the solvability of the problem, then the calculation will be much better. It can be observed that if each rule does not meet the minimum confidence requirement, then all subsets of the rule will not satisfy the minimum confidence requirement.

Correlation analysis using the Apriori algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.