Correlation analysis is a task that looks for interesting relationships in a large-scale data set. These relationships can take two forms: frequent itemsets or association rules. Frequent itemsets are collections of items that often appear in a piece, and association rules imply that there may be a strong relationship between the two items. The degree of support for an item set is defined as the proportion of records in the dataset that contain the itemsets. The confidence or confidence level is defined for an association rule such as {diaper}->{wine}. The credibility of this rule is defined as "support ({diaper-to-beer})/support ({diaper})"
Although most examples of association rule Analysis come from the retail industry, the technology can also be used in other industries, such as website traffic analysis and the pharmaceutical industry.
Apriori principle
--If an item set is frequent, then all of its subsets are also frequent. In turn, that is, if an item set is a non-frequent set, then all its superset is also infrequent.
Apriori algorithm
The two input parameters of the--apriori algorithm are the minimum support degree and the data set. The algorithm first generates a list of itemsets for all individual items. The transaction is then scanned to see which itemsets meet the minimum support requirements, and those that do not meet the minimum support level are removed. Then, the remaining collection is combined to produce a set of items that contain two elements. Next, re-scan the transaction, removing the itemsets that do not meet the minimum support level. The process repeats until all itemsets are removed.
The pseudo code is as follows:
When the number of items in the collection is greater than 0 o'clock
Build a list of candidate itemsets consisting of k items
Check the data to confirm that each itemsets is frequent
Keep frequent itemsets and build a list of candidate itemsets consisting of k+1 items
Mining Association rules from frequent itemsets
The quantitative index of association rules is called credibility. The credibility of a rule P-H is defined as support (p| H)/support (P).
Similar to the generation of frequent itemsets, we can produce many association rules for each frequent itemsets. If you can reduce the number of rules to ensure the solvability of the problem, then the calculation will be much better. It can be observed that if each rule does not meet the minimum confidence requirement, then all subsets of the rule will not satisfy the minimum confidence requirement.
Correlation analysis using the Apriori algorithm