Correlation analysis
1) What are some of the limitations of the correlation model?
In the association model, it is generally necessary to look for frequent itemsets, which makes it possible to generate a large number of candidate sets, the need to repeatedly scan the database and calculate the support of each candidate set in the candidate set, the lack of analysis of the rare information, the cost is large.
2) What is the correlation coefficient? How to interpret?
correlation coefficients are statistical indicators used to reflect the correlation between variables. The correlation coefficients are calculated by the product difference method, and the correlation degree between the two variables is also reflected by multiplying the difference between the two variables and their average value by two deviations; the linear single correlation coefficient is studied emphatically.
correlation coefficient ρ value between 1 to 1, ρ= 0 o'clock, called X, Y is not related ; |ρ| = 1 o'clock, called X, y fully correlated , at which time there is a linear function relationship between x, Y; |ρ| < 1 o'clock, the change of x causes part of the change of Y, the greater the absolute value of ρ, the greater the change of x caused by y, | ρ| > 0.8 when called highly correlated, when, that is | ρ| < 0.3, called low-correlation, others are medium-related.
But the correlation coefficient has a significant disadvantage, that is, it is close to 1 of the degree of data Group N correlation, which is easy to give a false impression. Because, when n is small, the correlation coefficient fluctuates greatly, the absolute value of some sample correlation coefficients is easy to be close to 1; When n is large, the absolute value of the correlation coefficient is easy to be small. In particular, when n=2, the absolute value of the correlation coefficient is always 1. therefore, when the sample capacity n is small, it is inappropriate to determine the close linear relationship between the variable x and y based on the correlation coefficients.
3) What is the difference between positive and negative associations? If the value of two attributes declines at the same rate as the basic, is it a negative association? Why?
Positive and negative associations are basically similar in proportion to inverse, but only to relationships is not a linear , and is roughly similar to the trend. The positive correlation is that two variables one increases with the other and the negative correlation is two variable one decreases with the other increment. If the values of the two properties fall at the same rate, not negative associations, they change the same trend and belong to the positive association.
4) How do I measure association strength? What is the correlation strength between?
Association intensity is the degree of similarity between two subjects, and is generally characterized by the number of times they are cited together. The greater the correlation strength, the higher the similarity, and the closer the "distance" is. Common method of similarity measurement-the person coefficient is applied to the data that renders the normal distribution . The parameter R that describes the linear correlation strength between two variables is [ -1,1].
5) It has been suggested that the number of hot fuel consumption devices is a property that may be relevant and can be added to the sample dataset in this chapter. Can you also think of other properties? Why might they be relevant? Do you think the properties you suggest might be associated with other properties in the dataset? If you know that there is an association between them, what will help?
In addition to the properties in the example, it is considered that the average indoor time for family members is also related to the demand for hot fuel. The average indoor time of a family member directly affects the time to maintain the room temperature and the consumption of hot fuel, and the greater the demand for hot fuel if the average length of time the family members are indoors, the more precise the target of Sarah's company is to be located.
Association Rules
1) What are association rules? What are their uses?
Association rules are knowledge patterns that describe the regularity that occurs between objects in a transaction, or, more specifically, the association rule is the amount of data that is quantified to describe how the appearance of the item x affects the appearance of the item Y. Can be used in shopping basket analysis, cross-selling, product catalog design, Loss-leader analyses, aggregation, classification and other aspects.
2) What are the two main indicators calculated in association rules and how are they calculated?
(1) The support degree of the rule x->y in the transaction data set D is a measure of the importance of the association rules, which reflects whether the association is a universal law, and shows how much of this rule is represented in all transactions. that is, the frequency at which X and Y appear simultaneously in all transactions is: Support (x->y). Calculation method: The trade data set contains both X and Y in the ratio of the number of trades to all trades: Support (x->y) = P (x∪y) = | {t:x∪y∈t,t∈d}|/| d|x100% (where | D| is the number of transactions in the transaction data set D)
(2) The confidence level of rule x and Y in transaction data set D is a measure of the accuracy of the Association rules. Measure the strength of association rules. That is, the frequency of y in all occurrences of x , that is, the inevitability of rule X and Y. Recorded as confidence (x->y).
Calculation method: The ratio of the number of trades containing X and Y to the number of trades containing x: Confidence (x->y) = P (y∣x) = | {t:x∪y∈t,t∈d}|/| {t:x∈t,t∈d}|x100%
The association rules that satisfy the minimum confidence threshold and the minimum support threshold are strong association rules.
3) What data type must a DataSet's attribute have to be in order to use the frequency mode operator in RapidMiner?
Must be a two value type of data.
4) How to interpret the rule results? In the example in this chapter, what is the strongest rule? How did we know that?
The support and confidence that may be selected between the associated pair of elements can be read from the result set, in this chapter the highest association strength is religious->rule, the support level is 0.239, and the confidence is 0.796.