Guess you like-----recommendation System Principle Introduction

Last Update:2015-07-05 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Written before the text

I recently made a referral system and made a share within the project team. Today, some time, will be a logical comb over, the PPT content with text precipitation down, easy to follow the recommendation system further research. The recommendation system is indeed extremely complex, and the road ahead is long.

A First Glance

Why a referral system is needed-information overload

With the development of the Internet industry, the way to obtain information is more and more, people from the active access to information gradually become passive acceptance of information, the volume is also in geometric multiples of the outbreak of growth. For example, the PC era with Google Reader, often with thousands of unread blog updates, today's public number, there are a lot of red dot not read. More and more junk information, the cost of users to obtain valuable information is greatly increased. To solve this problem, I personally took a more extreme approach: ignoring all of the push message entry directly. But in many cases, the speed at which effective information is obtained is extremely important.

Due to the explosive growth of information, the effectiveness of information acquisition, targeted demand has naturally emerged. Recommendation system came into being.

Amazon's Referral system

The earliest recommendation system should have been invented by Amazon to increase the user's arrival rate for long tail goods. There is already data to show that the sales of long tail goods and the sum of profits are basically flat with the hot goods. Amazon sells more than millions of items online, but the number of items that can be displayed on the home page is extremely limited, and it is important to recommend products that they may like. Of course, commodity search is also a big cake, Amazon's commodity search has already begun to erode Google's core business.

On Amazon's product display page, you can often see: Customers who browse this product also browse at the same time.

This is a very typical recommendation system. Gossip about: "Chop hand Clan" The rise, and recommendation system should have a certain relationship bar, haha.

Recommender Systems and Big Data

Big data and cloud computing are very popular at the moment. Big Data is a talk about topic, whether it's an industry colleague or a friend in another industry. It's like a hot topic in adolescence: "Sex." We all don't understand, but we all want to say a few words. The industry's use of big data is still in a relatively primitive stage of exploration, and before listening to the CEO of a gene company, it is now possible to fully export human genes into data, but these data are not regularly available, but they do not even know what to do. Recommendation system is also the use of user data to find the law, relatively start earlier, the use of more mature.

Cold start problem

The recommendation system needs data as support. But Amazon did not have a large and effective user behavior data when it was just starting to make recommendations. This is the time to face the "cold start" problem. Without user behavior data, use the content data of the commodity itself. This is the early approach of the referral system.

Content-based recommendations:

tag to the product tag: sports goods, FMCG, and so on. The finer the granularity, the more accurate the recommended results.
Product name, description of keywords by extracting keywords from the text description information of the product, thus using the similarity of the keyword to make recommendations
Customers who have purchased a product from a store with a different product from the merchant are recommended for other hot items in the store.
Using experience to make some artificial connections a classic example is that the store put a diaper on the side of the beer rack. So, the person who buys beer on the Internet, can also recommend diapers?

Because of the extreme complexity of the content, the rules can be extended indefinitely. Content-based recommendations are not related to user behavior data, and are a more reliable strategy in the early Amazon. But it is because of the complexity of the content, there will be many wrong recommendations. For example, Xiao Ming searched the Porsche model on the Internet. Then recommend the system according to the keywords, to Xiao Ming recommended the value of 2 million Porsche 911 ...

User behavior data-what exactly is being recorded

In the game, our personas are a bunch of complex data, called data stores, which are grouped together in a certain structure, called structures. Similarly, in the eyes of Amazon, we are a large number of complex numbers in a single table. Lift a chestnut:

Xiao Ming opened the Amazon at 9 in the morning, first browsed the first page, clicked on a few hot-selling suit links, and then entered in the search field Nike Basketball shoes, after browsing 8 pairs of sneakers, read some of the buyers evaluation, finally selected Air Jordan's latest model.

This is a typical user behavior data. Amazon will split this behavior into a set of data blocks, and then a certain data structure, stored in the Amazon user behavior Data Warehouse. Every day a large number of users in the production of such behavior data, the more data, the more things can be done the more powerful.

User-item User Preference Matrix

Data is collected to analyze user preferences and form user preference matrices. For example, in the online shopping process, the user has to view, buy, share the behavior of the goods. These behaviors are varied, so a certain weighted algorithm is needed to calculate the user's preference to a product, and to form a User-item user preference matrix.

Data cleanup

When we begin to consciously record user behavior data, the resulting user data is gradually exploding. Just like the noise that exists at the time of recording, the user data obtained also has a large amount of spam information. Therefore, the first step in getting the data is to clean up the data. One of the core tasks is noise reduction and normalization:

Noise reduction: User behavior data is generated in the user's use process, which contains a lot of noise and user error operation. For example, because of the network interruption, the user in a short period of time generated a large number of click Operations. Some strategies and data mining algorithms are used to remove noise from the data.

Normalization: The purpose of cleaning up data is to form a reasonable user preference matrix by weighting different behaviors. Users produce multiple behaviors, and the difference in the range of values for different behaviors can be very large. For example, the number of clicks may be much greater than the number of purchases, directly apply the weighted algorithm, may make the clicks on the results of too much impact. Therefore, it is necessary to return an algorithm to ensure that the range of values for different behaviors is probably consistent. The simplest algorithm is to divide all kinds of data by the maximum value in such data, so as to ensure that the range of all data is within the [0,1] range.

--SVD singular value decomposition of descending dimension algorithm

By recording user behavior data, we get a huge user preference matrix. As the number of items increases, the number of columns in this matrix is increasing, but for a single user, the number of items that have behavior data is quite limited, which results in the fact that this huge user preference matrix is actually quite sparse and the effective data is actually very small. The SVD algorithm is invented to solve this problem.

A large number of items extracted features, abstracted into 3 categories: vegetables, fruits, casual wear. In this way, the sparse matrix is reduced, which greatly reduces the computational amount. But this example is just to illustrate the principle of SVD singular value decomposition. In the real calculation implementation, there will be no artificial extraction of the characteristics of the process, but completely through the mathematical method of the abstract dimensionality reduction. by the continuous fitting of matrix multiplication, parameter adjustment, the original huge sparse matrix is decomposed into different matrices, so that it can be multiplied to get the original matrix. This can not only reduce the amount of computation, but also fill the above matrix hollow value of the part.

Collaborative filtering algorithm

I've been emphasizing user behavior data to pave the way for introducing collaborative filtering algorithms. Collaborative filtering, collaborative Filtering, CF, is widely used in today's recommender systems. Through the collaborative filtering algorithm, two similarity degrees can be calculated: User-user similarity matrix and Item-item similarity matrix.

Why is it called collaborative filtering? Because the two similarity matrices are calculated from each other. For a chestnut: 100 users both items A and B were purchased at the same time, and the similarity degree of A and b in the Item-item similarity matrix was 0.8. 1000 items are purchased by User C and user D at the same time, and the similarity of C and D in the User-user similarity matrix is 0.9. User-user, the similarity of Item-item is calculated by user behavior data.

There are several specific algorithms for calculating similarity: Euclidean distance, Pearson correlation coefficient, cosine similarity, Tanimoto coefficient. Specific algorithms, interested students can Google.

User Portrait

Mention big data, cannot but say user portrait. Often see a company such propaganda: "Mastered the behavior of tens of thousands of users of data, depicting a very valuable user portrait, can provide accurate user data for each app, to help promote the app." "Such marketing ads cannot withstand the slightest scrutiny. Users of each kind of the behavior of the app are different, the behavior of the data obtained from each other is very large, such as the user on the e-commerce website behavior data, the music class app is basically no value. The difficulty of recommendation system, a large part of it lies in the accumulation process of user portrait is extremely difficult. In short, the user portrait is closely related to the business itself.

LR Logistic regression

Based on the user preference matrix, a lot of machine learning algorithms have been developed, and the idea of LR is introduced here. The specific logistic regression is divided into linear and nonlinear. Other machine learning algorithms are: K-mean clustering algorithm, canopy clustering algorithm, and so on. Interested students can look at July's article. Link in the last reading of the original.

The LR logistic regression is divided into three steps:

Extracting Eigenvalue values
By using the user preference matrix, the weights of each eigenvalue are obtained by fitting the calculation continuously.
Predict how much a new user likes the item

Give me a chestnut:

Xiao Ming's Blind Date thousands of, we collected a large number of behavioral data, the following data is only the tip of the iceberg.

Through a large number of fitting calculations, the characteristic value of "personality and cheerful degree" weight of 30%, "Yan value" weight of 70%. Alas, to this face of the world has been desperate, finish this article, go to book a ticket to Korea.

Then, by fitting the weight, to predict Xiao Ming's love for the 1001th time.

This is the principle of LR logistic regression. Specific mathematical algorithms, interested students can Google.

How to make money with referral system

Or Amazon, for example. Xiao Ming is a basketball fan, every month will buy a few pairs of basketball shoes. Through a few months of purchase records, Amazon has been aware of Xiaoming's preferences, ready to give Xiao Ming recommended basketball shoes. But the basketball shoes brand so many, recommend which one? Smiled and said: Which brand to give me more money, it is recommended which brand. This is the simplest traffic business. These are called: Business rules.

But before you join the business rules, you need to make the user aware of the recommended accuracy rate. If you start by pushing some of the top VIP resources, it will greatly damage the user experience, so that users feel that the recommendation is completely inaccurate. Such consequences are devastating for the continued development of the referral system.

Filter rules

Collaborative filtering simply relies on user behavior data, and in a real referral system, there are many business factors to consider. Take the Music app as an example. Jay has released a new album A, most of the young people will click to listen, this will lead to every other album similar album will appear in album A. This time, it makes no sense to recommend such a popular album to users. Therefore, filtering out popular items is one of the most common practices of recommender systems. There are many such rules, depending on the business scenario.

Recommended diversity

Contrary to the accuracy of the recommendation, it is a multiplicity of recommendations. For example, recommended music, if fully according to user behavior data recommendation, will make the candidate set of recommendations results will always be in a relatively small range: listen to small fresh music, will never be recommended rock. This is a very complicated question. In the premise of ensuring the accuracy of recommendations, according to a certain strategy, to gradually broaden the scope of recommendations, to give a certain diversity of recommendations, so that will not be greasy.

Continuous improvement

Recommended systems are highly complex and require continuous improvement. It is possible to do A/b Test at the same time with different recommended algorithms on the line. Based on the user's behavior data on the recommended results, the algorithm is continuously optimized and improved. The road to go is still very long: the road of its repair far, I will be up and down and quest.

Guess you like-----recommendation System Principle Introduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More