Collaborative filtering of referral systems

Source: Internet
Author: User

This turn from the CSDN, very close to the project.

Collaborative filtering (collective Filtering) can be said to be the standard algorithm for Recommender systems.

In the discussion of the recommendation must talk about synergy today, we also talk about the KNN-based collaborative filtering in the actual recommendation of the application of some experience.

We start with the two hypothesis of collaborative filtering.

Two assumptions:

    1. Users generally like items that are similar to their favorite items
    2. Users tend to like things that other users like about themselves

The above assumptions correspond to two implementations of collaborative filtering: Object-based Similarity (ITEM_CF) and user-based similarity (USER_CF).

Therefore, in the implementation of collaborative filtering, the most essential task is to calculate the similarity, including calculating the similarity between the item, or the similarity between the user. To calculate similarity, you need an expression to calculate the similarity degree.

input data:

The input for collaborative filtering is a user-to-item scoring matrix, as shown in:

There is no extension description for item or user in this input data.

So the item and user can only be expressed in each other, describing an item with a rating of the item for all users, and describing the user by using the item with all ratings of a user. Corresponds to the line vector and column vectors in the en route respectively.

In this way, we just need to make the calculation of the similarity between the directional quantities is OK.

Calculation of similarity:

Taking item similarity as an example, we generally use the following formula to complete the calculation of similarity:

which

Wij represents the similarity of the two item labeled I,J

U (I,J) represents a collection of users who have ratings for i,j at the same time

Rui represents user U's rating of item I

Parameter lambda is smoothing (penalty) parameter

Of course, the similarity calculation has many options, such as calculating the Cos value directly from the angle of two vectors. Only through experiments found that the method mentioned in the LZ in the actual recommended application, the effect is slightly better.

For the user similarity, you only need to transpose the input matrix to look at it.

Scoring forecast Process:

Still take ITEM_CF as an example, after the item similarity is obtained, a matrix multiplication operation can be used to calculate a prediction value for the blank space in the input data.

Consider the input matrix as a n*m matrix, with A,n as the number of users and m as the item number. The computed similarity matrix is a m*m matrix, which is recorded as B.

The process of predicting a score is the process of a*b, and the result is a n*m matrix.

The specific formula is as follows:

Where bi is the popularity of the song itself, it can be calculated in other ways.

Complexity of similarity calculation:

The complexity of the item similarity calculation: The number of users in input matrix A (n*m) is n,item m, then to calculate a m*m similarity matrix, it is necessary to o=m*m*n*d,d the sparse degree of the data.

Suppose n=200w,m=5w, with data sparsity of 1%, is O=50 trillion. Fortunately, the process can be easily distributed on HDFs using the Map-reducer method.

Assuming you're done with 1000 compute nodes, the complexity on each compute node is O1=5000 billion.

The complexity of the user similarity calculation: o=n*n*m*d, assuming that the compute resources are still 1000, the complexity of each compute node is o1=2000 trillion.

Two implementations of similarity calculation:

In general, the amount of item is relatively small, and fixed, and the amount of users will be relatively large, and every day may be different.

Therefore, the practical strategy in the calculation process is also different, in general there are the following two kinds:

1. Inverted type calculation:

This approach is especially useful when the number of users is large and the item volume is small, and each user has scored fewer item counts.

The specific approach is:

In the Mr Map phase, each user's scoring item group is synthesized by the pair <left,right,leftscore,rightscore> output, left as the distribution key, and left+right as the sort key.

In the reduce phase, the data from the map is scanned once to obtain the similarity of all item.

The detailed Hadoop process is as follows:

The advantage of this approach is that there is no computational overhead of similarity between item without correlation, so-called no correlation means that no one user has scored on these two item.

The downside is that if some users have more scoring data, the pair pair will be huge, and there will be a significant IO overhead between the map and the reducer.

2. Matrix chunking Calculation:

This calculation method is applicable to the calculation process of user similarity degree, which is much larger than the item quantity.

Specific approach: The scoring matrix M is divided into small pieces, each small block and the original matrix of the transpose matrix to do similar to the matrix multiplication operation (according to the similarity in the text, rather than the vector product),

Finally, the calculated results can be combined.

The detailed Hadoop process is as follows:

The advantage of this approach is to avoid a large number of caches between map and reducer, and this approach doesn't need to be reducer at all.

The downside is that transpose (m) needs to be cached on each Hadoop compute node before the task begins to compute, and will affect the efficiency of the task's startup when the Matrix M is large.

Both methods have advantages and disadvantages, the accurate calculation of the similarity between user or item needs to choose the appropriate calculation method according to the actual data characteristics.

However, as the business grows, the scoring matrix becomes larger and more dense.

In either case, it is impossible to accurately calculate the similarity between user or item.

Moreover, considering the actual recommendation scenario, the exact calculation of the similarity is not necessary in the case of enough data.

Calculation of similarity based on Simhash:

When the amount of data is too large, it is often only necessary to obtain an approximate solution similar to the optimal solution, and so is the calculation of similarity.

Calculating the similarity between users or item based on Simhash is a common technique in recommendation.

The method can work, mainly based on the following two points: 1.hash randomness, 2. enough data.

These two points determine whether the similarity results obtained by the Simhash calculation are reliable.

And about the specific principle of simhash, this article does not do in detail, unfamiliar classmate can refer to link simhash algorithm principle or other related material.

Here is the main introduction of their practice in the actual work:

First the hash function is selected:

I have tried two kinds of hash functions in the actual work:

1. djbx33a, map the string to a 64-bit unsigned integer. Specific implementation reference: DJBX33A hash function implementation

2. MD5, map the string to a 128-binary stream. I use char[16 in the actual work] to express.

Hamming distance and similarity degree:

After calculating the hashsign of each object, the Hamming distance of two hashsign can be obtained further. Assuming that the final sign is n-dimensional, the Hamming distance is a maximum of n and the minimum is 0. The corresponding similarity is -1,1 respectively.

When Hamming distance is N/2, the similarity between the corresponding objects of two hashsign is 0, which can also be understood as the angle of PI/2.

The Hamming distance and similarity can be converted by the following formula:

where n is the sign dimension and Hij is the Hamming distance between i,j.

Calculation of Hamming distance:

1. If the hashcode is within 64 bits, the final computed hashsign can be expressed using the computer's built-in type, such as unsigned long.

Then the Hashsig bitwise XOR of two objects, and the number of bits in the result is 1, which is the Hamming distance value.

2. If the hashcode is greater than 64 bits, such as MD5, a composite structure is required to represent it. We can use char[16] to represent 128bit.

Then, the Hamming distance is calculated using two unsigned long to indicate the high 64bit and low 64bit.

It is proved by practice that the similarity between the Simhash and the real similarity is almost identical.

Furthermore, in practical applications, the robustness is stronger by Simhash recommendations.

Collaborative filtering of referral systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.