Commodity correlation Analysis
Association
Relevance: Mainly used in the Internet content and documents, such as search engine algorithm documents in the relationship between.
Association: It is used on the real thing, such as the correlation between the goods on the e-commerce website.
Support: The probability that a dataset contains several specific items.
For example, the number of beers and diapers in a 1000-time commodity transaction is 50 times, so the association supports 5%.
Confidence level (Confidence): When a is already present in the dataset, the probability of B occurring, and the formula for calculating the confidence is: the probability of A and b appearing at the same time.
Suppose 10,000 people buy a product, which buys a product of 1000, the person who buys the B product is 2000, AB buys the person is 800.
Support degree refers to the associated product (assuming a product and B product Association) at the same time to purchase the proportion of the total number of people, namely, 800/10000=8%, 8% of the users simultaneously purchased A and b two products;
Credibility refers to the possibility of purchasing another product after a product has been purchased, such as the confidence =800/1000=80% of buying a product after a purchase, that is, 80% of the users will purchase a product B after they have purchased a products;
The degree of promotion is the possibility of purchasing B products under the condition of purchasing a product and the ratio of the possibility of purchasing B products under this condition, without any conditions to purchase B product possibility =2000/10000=20%, then the lift degree =80%/20%=4.
Data Association is a kind of important and discoverable knowledge that exists in the database. If there is a regularity between the values of two or more variables, it is called Association. Correlation can be divided into simple association, Timing Association, causal Association and so on. The purpose of association analysis is to find out the hidden network of associations in the database. Sometimes you do not know the correlation function of the data in the database, or even if you know it is indeterminate, so the rules generated by the association analysis have confidence.
A typical example of association rule mining is shopping basket analysis (Mba,market basket Analyst). The Research of association rules can help to identify the relationship between different items in the transaction database, and identify the customer purchase behavior patterns, such as the effect of purchasing a certain commodity on the purchase of other goods. The analysis results can be applied to the product shelf layout, the storage arrangement and the user classification according to the purchase mode.
The discovery process of association rules can be divided into the following two steps:
The first step is to iterate over all the frequent itemsets (frequent itemsets), requiring frequent itemsets to be no less than the user-defined minimum value;
The second step is to construct the rule that the confidence level is not lower than the minimum value set by the user from the frequent project concentration. Identifying or discovering all frequent itemsets is the core of the association rule Discovery algorithm, and it is also the most computationally intensive part.
Minimum support (min-support) and minimum confidence level (min-confidence)
Support and confidence two thresholds are the two most important concepts that describe association rules. The frequency at which a project group appears is called support, reflecting the importance of association rules in the database. The reliability of the association rules is measured by the confidence level. If a rule satisfies both the minimum support level (Min-support) and the minimum confidence level (min-confidence), it is called a strong association rule.
Association rule Data Mining phase
The first phase must identify all high-frequency project groups (Large itemsets) from the original data set. High frequency means that the frequency of a given project group must reach a certain level relative to all records. As an example of a 2-itemset containing a and B two items, we can obtain the support degree of a project group containing {A, a}, and if the support degree is greater than or equal to the minimum supported (Minimum support) threshold, then {A, a} is called the high-frequency project group. A k-itemset that satisfies the minimum support is called a high-frequency K-project group (frequent K-itemset), which is generally expressed as large k or frequent K. The algorithm then attempts to generate a project set large k+1 with a length of more than k from the large K project group until a longer high-frequency project group can no longer be found.
The second phase of association rule mining is to generate association rules. Generating association rules from high-frequency project groups is to generate rules using the high-frequency K-project group in the previous step, and the rule is called an association rule if the confidence obtained by a rule satisfies the minimum confidence level under the condition threshold of minimum confidence (Minimum Confidence).
For example: a rule generated by a high-frequency K-project group {A, b}, if its confidence is greater than or equal to the minimum confidence, then {A, b} is called an association rule.
In the case of "beer + diaper", using association rules mining technology to data mining in the transaction database, we must first set minimum support and minimum confidence two thresholds, assuming minimum support degree min-support=5% and minimum confidence min-confidence =65%. Therefore, the matching requirements of the association rules will have to meet the above two conditions. If the association rules found by the excavation {diaper, beer} meet the following conditions, the association rules of {diaper, beer} will be accepted. The formula can be described as:
Support (diapers, beer) ≥5% and Confidence (diapers, beer) ≥65%.
Among them, support (diapers, beer) ≥5% in this application example: in all of the transaction data, at least 5% of the transactions show diapers and beer the two goods are bought simultaneously the transaction behavior. Confidence (diapers, beer) ≥65% in this application example, the implication is that at least 65% of all transaction records containing diapers buy beer at the same time.
Therefore, if there is a consumer to buy diapers in the future, we will be able to recommend that consumers buy beer at the same time. The recommended behavior is based on the {Diaper, Beer} Association rules, because in terms of past transactions, "most of the diaper-buying transactions will be purchased at the same time," the consumer behavior.
As can be seen from the above Introduction, association rules mining is usually more suitable for the records of the indicators to take the discrete value of the case.
If the indicator values in the original database are continuous data, then the appropriate data discretization should be done before association rule mining (in fact, the value of an interval corresponds to a certain value), the data discretization is an important link before data mining, the discretization process is reasonable will directly affect the mining results of association rules.
Fp-growth algorithm
The fp-growth (frequent pattern growth) algorithm is an association analysis algorithm that Jiawei Han teacher proposed in 2000, which takes the following split strategy: Compress a database that provides frequent itemsets into a frequent pattern tree (Fp-tree), but still retain the Itemsets association information The algorithm and the Apriori algorithm the biggest difference has two points: first, does not produce the candidate set, second, only needs two times to traverse the database, greatly improved the efficiency.
Apiorio algorithm
If an item set is non-frequent, its superset must also be infrequent.
Reference
An association algorithm of e-commerce data Mining (I.)
—
How to understand Pearson correlation coefficients (Pearson Correlation coefficient)?
LLR
Private Double doitemsimilarity(LongItemID1,LongItemID2,LongPreferring1,LongNumusers)throwstasteexception {Datamodel Datamodel = Getdatamodel ();LongPreferring1and2 = Datamodel.getnumuserswithpreferencefor (itemID1, itemID2);if(Preferring1and2 = =0) {returnDouble.NaN; }LongPreferring2 = Datamodel.getnumuserswithpreferencefor (itemID2);DoubleLoglikelihood = Loglikelihood.loglikelihoodratio (Preferring1and2, Preferrin G2-preferring1and2, Preferring1-preferring1and2, Numusers-preferring1-preferring2 + preferring1and2);return 1.0-1.0/ (1.0+ Loglikelihood); }
Long Preferring1and2 = Datamodel.getnumuserswithpreferencefor (itemID1, itemID2);
Long preferring1 = Datamodel.getnumuserswithpreferencefor (itemID1);
Long preferring2 = Datamodel.getnumuserswithpreferencefor (itemID2);
Long numusers = Datamodel.getnumusers ();
K11:preferring1and2
K12:preferring2-preferring1and2
K21:preferring1-preferring1and2
K22:numusers-preferring1-preferring2 + Preferring1and2
Event A |
Everything but A |
Event B |
K11 |
Everything But B |
K21 |
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}
/** * Calculates the Raw Log-likelihood ratio for both events, call them A and B. Then we have: * <p/> * <table border= "1" cellpadding= "5" cellspacing= "0" > * <tbody><tr><t D> </td><td>event a</td><td>everything but a</td></tr> * <tr> <td>event B</td><td>a and B together (K_11) </td><td>b, but not A (K_12) </td></tr > * <tr><td>everything but b</td><td>a without B (k_21) </td><td>neither A nor B ( K_22) </td></tr></tbody> * </table> * *@paramK11 the number of times the "the" and "events occurred together *@paramK12 the number of times the second event occurred without the first event *@paramK21 the number of times the first event occurred without the second event *@paramK22 the number of times something else occurred (i.e. was neither of these events *@returnThe raw log-likelihood ratio * * <p/> * credits to Http://tdunning.blogspot.com/2008/03/surprise-and-coincidenc E.html for the table and the descriptions. */ Public Static Double Loglikelihoodratio(LongK11,LongK12,LongK21,LongK22) {preconditions.checkargument (K11 >=0&& K12 >=0&& K21 >=0&& k22 >=0);//Note that we had counts here, not probabilities, and that the entropy was not normalized. DoubleRowentropy = Entropy (K11 + K12, k21 + k22);DoubleColumnentropy = Entropy (K11 + k21, K12 + k22);DoubleMatrixentropy = Entropy (K11, K12, K21, k22);if(Rowentropy + columnentropy < matrixentropy) {//Round off error return 0.0; }return 2.0* (Rowentropy + columnentropy-matrixentropy); }
/** * Merely an optimization for the common two argument case of {@link #entropy(long...)} * @see #logLikelihoodRatio(long, long, long, long) */ privatestaticdoubleentropy(longlong b) { return xLogX(a + b) - xLogX(a) - xLogX(b); }
Information entropy is a measure of how chaotic or decentralized a distribution is. The more dispersed the distribution (or the more evenly the distribution), the greater the entropy of information. The more orderly the distribution (or the more concentrated the distribution), the less information entropy is.
Information retrievalentropy (information theory) #
Reference
Mahout Recommender document:non-distributed
Similarity measurement in machine learning
Mahout on Spark:what ' s New in recommenders
Mahout on Spark:what's New in Recommenders, part 2
Intro to Cooccurrence recommenders with Spark
Mahout:scala & Spark Bindings
Surprise and coincidence
How to create and App using Mahout
FAQ for using Mahout with Spark
Mahout on Spark:what's New in Recommenders, part 2
Here similar means this they were liked by the same people. We'll use another technique to narrow the items down to ones of the same genre later.
Intro to Cooccurrence recommenders with Spark
RP = Recommendations for a given user
HP = History of purchases for a given user
A = The matrix of all purchases by all users
RP = [A^TA]HP
This would produce reasonable recommendations, but is subject to skewed results due to the dominance of popular items. To avoid this, we can apply a weighting called the log likelihood ratio (LLR), which is a probabilistic measure of the Imp Ortance of a cooccurrence.
The magnitude of the value in the matrix determines the strength of similarity of the row item to the column item. We can use the LLR weights as a similarity measure that's nicely immune to unimportant similarities.
Itemsimilaritydriver
Creating the indicator matrix [AtA] is the core of this type of recommender. We have a quick flexible-to-create this using the text log files and creating output that's in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would has to does all the data prep themselves. Translating their own user and item IDs into Mahout IDs, putting the data to text files, one element per line, and Feedi ng them to the recommender. The other end of you ' d get a Hadoop binary file called a sequence file and you ' d has to translate the Mahout IDs into so Mething your application could understand. No more.
Part 4:tuning Your Recommender
Two-point improvement
MAP
Https://www.kaggle.com/wiki/MeanAveragePrecision
What's wanted to know about Mean Average Precision
"Mahout on Spark + elastic Search Build Item recommendation System"
Commodity correlation Analysis