Introduction to recommendation engine algorithm learning: collaborative filtering, clustering, and classification

Last Update:2014-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

July. Source: method and algorithm of Structure

This article is transferred from the Internet and only used for learning and adding to favorites. If there is any infringement, please contact me to delete it.

Introduction

Yesterday I saw several keywords: Semantic Analysis, collaborative filtering, and intelligent recommendation. I was excited when I thought about it. So from yesterday afternoon till this morning, I studied the recommendation engine and made a preliminary understanding. In the future, I will take a closer look at it (future work is also relevant ). Of course, this article will be gradually supplemented and improved.

This article is an introductory article on receng, which will skip most of the details, I will give a brief introduction to the working principle of the recommendation engine and its related algorithm ideas in the simplest language. In order to make it easier to understand, I have cited the words I posted on Weibo in January 7, to ensure that this article is short. However, what is counterproductive is that the article will be supplemented and improved later, and the longer it is written.

At the same time, all the related algorithms in this article will be elaborated in future articles one by one. In this article, I want to explain the introduction to microservices, but I want to explain it in detail later. If you have any questions, please feel free to give me some comments or correct them. Thank you.

1. Recommendation Engine Principle

The receng tries its best to collect as much user information and behaviors as possible. The so-called Internet sharing, frequent fishing, and then "special love for you ", finally, based on the similarity, the principle is shown in (the figure is taken from one of the references in this article: Exploring the secrets inside the receng ):

2. receng Classification

Receng is classified as follows based on different criteria:

Based on whether it recommends different data for different users, it can be divided into popular items based on public behavior (recommended by website administrators, or calculated based on the feedback of all users in the system) and personalized recommendation engine (to help you find like-minded and interesting friends, and then implement recommendations on this basis );
Data sources are classified into demographic-based (similar users are identified by the same age or gender) and content-based (items have the same keywords and tags without human factors ), and collaborative filtering-based recommendations (recommendation of items, content or user relevance is divided into three sub-categories, which are described below );
Based on the Creation method, it is divided into item-based and user-based (User-item two-dimensional matrix describe user preferences, clustering algorithm) the Apriori algorithm is the most influential Algorithm for mining frequent item sets of Boolean association rules, and model-based recommendation (machine learning, machine Learning is a sub-field in the field of artificial intelligence ).

Collaborative Filtering recommendation in the second category (2. Based on the data source): With the development of Web, Web sites advocate user participation and user contribution, therefore, the recommendation mechanism based on collaborative filtering is born because of its operation. The principle is very simple, that is, discovering the relevance of the item or content, or discovering the relevance of the user based on the user's preference for the item or information, and then making recommendations based on the relevance.

Collaborative Filtering-based recommendations are divided into three sub-categories:

User-based recommendation (finding similar neighbor users through common tastes and preferences, K-neighbor algorithms, which your friends like and you may like ),
Project-based recommendation (similar items are recommended when similarity between items is found. You like items A, C, and A, and may also like items c ),
Model-based recommendation (a recommendation model is constructed based on the user preferences of the sample, and then the recommendation is predicted based on the real-time user preferences ).

We can see that this collaborative filtering algorithm maximizes the similarity between users or items, and then implements recommendations based on the information. Collaborative Filtering is also described below. However, in practice, we usually divide the recommendation engine into two categories:

The first category is collaborative filtering, that is, collaborative filtering recommendation based on similar users (all information and clues left by user interaction with the system or the Internet, or the links between users and users ), collaborative Filtering recommendation based on similar items (similarity between items can be found as much as possible); the second category is content analysis-based recommendation (questionnaire, email, or the receng analyzes the content of this blog ).

3. Sina Weibo recommendation Mechanism

In Sina Weibo's friend recommendation mechanism: 1. I am not a friend with a, but many of my friends are friends with A, that is, I have many friends with, then the system will recommend a to me (SINA referred to as a common friend); 2. Many of my followers are concerned with B, so the system speculate that I may also like B, in this way, B will also be recommended to me (SINA called indirect followers ).

However, Sina's actual operations will be stirred up by the two methods. For example, many of my followers are concerned with B, but in fact, some of the people who care about B are also my friends. The above recommendation methods are collectively referred to as collaborative filtering recommendations based on similar users (nothing more than finding a link between users, or starting with your friends, or start with the person you care about ).

Of course, there is another kind of popular user recommendation, that is, the recommendation based on mass behaviors described above, that is, the Recommendation Based on Human cloud and follow-up. The system guessed that everyone liked it. Maybe you liked it too. As we all know, Yao Chen ranks first among the Sina Weibo fans, and the number of followers increases accordingly. See the following two recommended methods:

However, neither the above-mentioned user-based recommendation method nor the recommendation based on public behavior did not really find the common interests, preferences, and tastes between users, because many times, friends of friends may not be your own friends, and some of them are high in the world, you are all pursuing, I'm not so disdainful. Therefore, starting from analyzing the content of Weibo posts published by users, finding their common concerns and points of interest is king. Of course, Sina Weibo recently asked users to tag their published Weibo content, so as to facilitate searching for tags and keywords shared by relevant users in Weibo content in the future, this recommendation method is based on Weibo content analysis. For example:

But the question is, who will spare no effort to add tags to Weibo? Therefore, Sina Weibo has to work hard to find another way to better analyze Weibo content. Otherwise, the system will scan the contents of the Weibo account of the users in the sea, so you may not be able to afford it.

However, I personally think that we can start with Weibo keywords (Tag Cloud) and the tags each user uses for themselves (the more common tags can be defined as similar users), as shown in the left and right sections:

That is to say, it is unreliable to define similar users through mutual friends and people who are indirectly concerned. It is feasible to search for similar users through analysis based on Weibo content. At the same time, we can go further, after the tag cloud is obtained through Weibo content analysis, finding the same or similar tag from the cloud to find similar users is undoubtedly more reliable than existing friend recommendations (defining similar users through mutual friends and indirectly interested users.

3.1 combination of multiple recommendation Methods

The recommendation on the current web site often does not simply adopt a recommendation mechanism and strategy. They often combine multiple methods, to achieve better recommendation results.

For example, in Amazon, in addition to user-based recommendations, content-based recommendations (items with the same keywords and tags) are also used: New Product recommendations; project-based collaborative filtering recommendation (like a, c, and a, and may also like C): such as bundling and buying/browsing items by others.

In short, a combination of multiple recommendation methods, weighted (using a linear formula (linear formula) to combine several different recommendations according to a certain weight, the specific weight value needs to be repeated experiments on the test dataset, to achieve the best recommendation results .) , Switching, partitioning, and layering. However, no matter which recommendation method is used, it is generally included in the recommendation method described above.

4. Collaborative Filtering and recommendation

Collaborative Filtering is a typical method of using collective wisdom. To understand what collaborative filtering (CF) is, first consider a simple question. If you want to watch a movie, but you don't know which one to watch, what do you do? Most people will ask their friends or neighborhood in a broad sense to see what movie recommendations are available recently, we generally prefer to get recommendations from friends with similar tastes. This is the core idea of collaborative filtering. For example, how much information can you see?

4.1 collaborative filtering and recommendation steps

Follow these steps to perform collaborative filtering and recommendation:

1) if collaborative filtering is required, collection of user preferences becomes critical. You can vote by user behaviors, such as rating (for example, different users have different scores for different works, and similar scores mean similar tastes and can be determined as similar users, forwarding, saving, bookmarks, tags, comments, click streams, page stay time, whether to purchase, etc. As described in the following 2nd: all such information can be digitalized, as expressed by a two-dimensional matrix.

2) After collecting user behavior data, we need to reduce noise and normalize the data (to obtain a two-dimensional matrix of user preferences, one dimension is the user list, and the other one is the item list, A value is a user's preference for an item. It is generally a floating point value of [0, 1] or [-1, 1 ). Next, we will briefly introduce the noise reduction and normalization operations:

Noise Reduction: user behavior data is generated when users use applications, which may cause a lot of noise and user misoperations, we can use the classic data mining algorithm to filter out noise in behavior data. This can make our analysis more accurate (similar to the de-noise processing of webpages ).

The so-called Normalization: data of each behavior is unified in the same value range, so that the overall preference of the weighted sum is more accurate. The simplest normalization process is to divide all types of data by the maximum value in this class to ensure that the normalized data value is in the range of [0, 1. As for the so-called weighting, it is easy to understand, because each person has different weights, similar to voting for a certain number of contestants in a singing competition to determine whether to promote, the audience vote for 1 point, the expert judges vote for 5 points, and the contestants with the most scores will be promoted directly.

3) How can we find similar users and items? It is used to calculate the similarity between similar users and similar items.

4) similarity calculation involves multiple methods, but all of them are based on vector vectors. In fact, the distance between two vectors is calculated. The closer the distance is, the higher the similarity. In recommendation, under the two-dimensional matrix of user-item preferences, we use a user's preferences for all items as a vector to calculate the similarity between users, or, all users' preferences for an item can be used as a vector to calculate the similarity between items.

Similarity calculation algorithms can be used to calculate user or project similarity. Item Similarity Computation (item Similarity Computation) is used as the column to select a common scoring user for two project I and j from the scoring matrix, calculate the similarity Si and J for the scoring vector of the common user, as shown in. The row represents the project and the column represents the user (note that the comments are extracted from the I and j vectors, A pair of vectors for similarity calculation ):

Therefore, it is very easy to find similarity between items. The user does not change, and multiple users are asked to rate items. The similarity between users is not changed, find the user's rating for certain items.

5) The calculated similarity is recommended for collaborative filtering based on users and projects. Common similarity calculation methods include Euclidean distance and Pearson correlation coefficient (for example, two users rating multiple movies and Using Pearson correlation coefficient and other related calculation methods, you can choose whether their tastes and preferences are consistent), cosine similarity, and tanimoto coefficient. Next, we will briefly introduce the Euclidean distance and Pearson correlation coefficient:

The Euclidean distance (Euclidean distance) was originally used to calculate the distance between two points in the Euclidean space. Assume that X and Y are two points in the n-dimensional space, and the Euclidean distance between them is:

We can see that when n is 2, Euclidean distance is the distance between two points on the plane. When Euclidean distance is used to represent similarity, the following formula is generally used for conversion: the smaller the distance, the larger the similarity (at the same time, the Division is not 0 ):

Cosine similarity cosine-based similarity two project I, J is regarded as two M dimension user space vectors, similarity calculation by calculating the cosine angle between the two vectors, then for M * n Scoring matrix, i, j similarity SIM (I, j) calculation formula:

("·" Records the inner product of two vectors) Pearson correlation coefficient is generally used to calculate the closeness between two fixed-distance variables. To make the calculation result accurate, you need to find the users with the same score. Note that user set U is a user set that both commented on I and commented on J. The Pearson correlation coefficient calculation formula is as follows:

Among them, Ru and I are the user's U's score on project I, and corresponding to the horizontal bar is the user's U's score on project I.

6) Similar neighbor calculation. There are two types of neighbors: 1. A fixed number of neighbors, K-neighborhoods (or fix-size neighborhoods). Regardless of the neighbor's "distance", only the nearest K neighbors are used as their neighbors, as shown in section A; 2. The neighbor based on the similarity threshold falls into the current vertex as the neighbor, and all vertices in the region with a distance of K are used as the neighbor of the current vertex, as shown in section B.

Next, we will introduce the K-Nearest Neighbor (k-nearest neighbor, KNN) classification algorithm. This is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: if most of the K most similar samples in the feature space (that is, the most adjacent samples in the feature space) belong to a certain category, the sample also belongs to this category.

7) after 4) calculated user-based CF (recommended by users: find similar neighbor users through common tastes and preferences, K-neighbor algorithm, which your friends like, you may also like it). item-based CF (item-based recommendation: similar items are recommended to discover similarity between items. Items A and C are similar to items, so you may also like C ).

Generally, for social websites such as Facebook, user CF is preferred ), for example, Amazon is recommended to use item CF on book purchasing websites (you have read books like this before, which is more convincing to you than XX has read this book, because you are not familiar with books ).

4.2 differences between project similarity and user Similarity

In section 3.1, the three similarity formulas are based on the project similarity scenario. In fact, a basic difference between user similarity and project similarity calculation is, the user similarity is based on the row vector similarity in the scoring matrix. The item similarity calculation method is based on the column vector similarity in the scoring matrix. The three formulas can be applied to each of them, for example:

(0 indicates no score)

Calculate the similarity between two columns, such as item3 and item4, based on project similarity calculation;

Calculate the similarity between user similarity vectors, such as user3 and user4.

5. Clustering Algorithms

Clustering, in general, is the so-called "thing together, people in groups ". Clustering is a classic problem of data mining. It aims to divide data into multiple clusters, with high similarity between objects in the same cluster, objects in different clusters vary greatly.

5.1 K-means clustering algorithm

The K-means clustering algorithm is similar to the maximum Expectation Algorithm for processing the mixed normal distribution, because they all try to find the natural clustering center in the data. This algorithm assumes that the object property is from a space vector, and the goal is to minimize the sum of mean square errors in each group.

The K-means clustering algorithm first randomly determines K Centers (points in the space representing the cluster center), and then assigns each data item to the nearest center. After the allocation is complete, the cluster center moves to the average position of all nodes allocated to the cluster, and then the entire allocation process starts again. This process repeats until the allocation process does not change. Is a K-means clustering process that contains two clusters:

The following code shows the python Implementation of the K-means clustering algorithm: // The K-means clustering algorithm import randomdef kcluster (rows, distance = Pearson, K = 4 ): # determine the minimum and maximum values of each vertex ranges = [(min ([row [I] For row in rows]), max ([row [I] For row in rows]) for I in range (LEN (rows [0])] # randomly create K centers clusters = [[random. random () * (ranges [I] [14]-ranges [I] [0]) + ranges [I] [0] For I in range (LEN (rows [0])] For J in range (k)] lastmatches = none for T in range (100): Print 'iteration % d' % t bestmatches = [[] For I in range (k)] # search for the nearest center point in each row for J in range (LEN (rows): ROW = rows [J] bestmatch = 0 for I in range (k ): D = distance (Clusters [I], row) If D <distance (Clusters [bestmatch], row): bestmatch = I bestmatches [bestmatch]. append (j) # if the result is the same as the previous one, the whole process ends if bestmatches = lastmatches: break lastmatches = bestmatches # Move the center point to the average position of all its members for I in range (k): avgs = [0.0] * Len (rows [0]) if Len (bestmatches [I])> 0: For rowid in bestmatches [I]: For m in range (LEN (rows [rowid]): avgs [m] + = rows [rowid] [m] For J in range (LEN (avgs): avgs [J]/= Len (bestmatches [I]) clusters [I] = avgs # returns the K-group sequence, where each sequence represents a cluster return bestmatches

K-means is a kind of unsupervised learning in the machine learning field. Next, we will briefly introduce supervised learning and unsupervised learning:

The task of supervised learning is to learn tagged training data to predict any valid input value. Common examples of supervised learning include classifying email messages as spam, marking webpages by category, and recognizing handwritten input. Many algorithms are required to create a regulatory learning program. The most common algorithms include neural networks, support vector machines (SVMs), And Naive Bayes classification programs. The task of unsupervised learning is to make full use of data, regardless of whether the data is correct or not. It is most often used to integrate similar input into logical groups. It can also be used to reduce dimension data in a dataset to focus only on the most useful attributes or to identify trends. Common unsupervised learning methods include K-means, hierarchical clusters, and Self-Organizing Maps.

5.2 canopy Clustering Algorithm

The basic principle of the canopy clustering algorithm is: first, the low-cost approximate distance calculation method is used to efficiently divide the data into multiple groups, which is called a canopy, let's translate it into "Huawei", and canopy can overlap. Then, we use a strict distance calculation method to accurately calculate the points in the same canopy, allocate them to the most appropriate cluster. The canopy clustering algorithm is often used to pre-process the K-means clustering algorithm to find the appropriate K-value and cluster center.

5.3. Fuzzy K-means clustering algorithm

The Fuzzy K-means clustering algorithm is an extension of K-means clustering. Its basic principle is the same as K-means, but its clustering results allow the existence of objects belonging to multiple clusters, that is: it belongs to the overlapping clustering algorithm we introduced earlier. To better understand the differences between the fuzzy K-means and the k-means, we need to take some time to understand the concept of the fuzzy parameter (fuzziness factor ).

Similar to K-means clustering, fuzzy K-means is also circulating in the vector set of the target cluster, but it does not allocate the vector to the nearest cluster, instead, the correlation between vectors and clusters is calculated ). Suppose there is a vector V with k clusters. The distance between V and the center of k clusters is D1, D2, respectively? DK, then the correlation U1 from V to the first cluster can be calculated using the formula below:

To calculate the correlation between V and other clusters, you only need to replace D1 with the corresponding distance. From the formula above, we can see that when m is close to 2, the correlation is close to 1; when m is close to 1, the correlation is close to the distance to the cluster, so the value of M is at (1, 2) in the interval, the greater m, the greater the degree of fuzzy. m is the fuzzy parameter we just mentioned.

Other clustering algorithms are not described in this article. Questions about cold start, data sparse, scalability, portability, interpretability, diversity, and value of recommendation information will be further elaborated.

6. Classification Algorithms

Next, there are many classification algorithms. This article introduces Decision Tree Learning and Bayesian theorem.

6.1 Decision Tree Learning

Let's start with the question. A decision tree, as its name implies, is a tree established based on a policy decision.

In machine learning, a decision tree is a prediction model, which represents a ing between object attributes and object values. Each node in the tree represents an object, and each forks PATH represents a possible attribute value, each leaf node corresponds to the object value represented by the path from the root node to the leaf node. A decision tree has only one output. To have a plural output, you can create an independent decision tree to process different outputs. The machine learning technology that generates decision trees from data is called Decision Tree Learning.

The theory is too abstract. Here are two simple examples:

Example 1: In general, the idea of decision tree classification is similar to looking for objects. Now imagine a girl's mother would introduce her boyfriend to her, so she had the following conversation:

Daughter: How old is it? Mother: 26. Daughter: Long Shuai? Mother: very handsome. Daughter: high income? Mother: Not very high. Moderate. Daughter: is it a civil servant? Mother: Yes. I work in the tax bureau. Daughter: Well, I'll see you.

This girl's decision-making process is a typical classification tree decision. It is equivalent to dividing men into two categories by age, appearance, income, and whether or not civil servants: Seeing and seeing. Assume that the girl's requirements for men are: civil servants under the age of 30, who are of medium or higher sizes and are high-income or above, this can be used to represent the Decision-Making logic of girls:

That is to say, the simple strategy of decision tree is to filter a person's resume during the company's recruitment interview process. If your conditions are quite good, such as Tsinghua doctor's graduation, you can directly ask for an interview without saying anything, if you have graduated from a non-key university, but have rich experience in practical projects, you should also ask for an interview, that is, the specific analysis and decision-making of the specific situation.

The second example is from Tom M. Mitchell's Book Machine Learning:

Mr. Smith's goal is to use the weather forecast next week to find out when people will play golf. He understands that the most important reason people decide whether to play the game depends on the weather conditions. The weather conditions are clear, clouds and rain; the temperature is expressed in Fahrenheit; the relative humidity is expressed in percentage; and there is also no wind. In this way, we can construct a decision tree as follows (based on the weather classification, we can decide whether to play tennis on this day ):

The preceding decision tree corresponds to the following expression: (outlook = sunny ^ humidity <= 70) V (outlook = overcast) V (outlook = rain ^ wind = WEAK ). Shows the optimal Classification attributes:

Two different properties are calculated: humidity and wind information gain. The final information gain of humidity is 0.151> 0.048 of wind gain. To put it bluntly, humidity is better than wind as the classification attribute in the strategy to determine whether it is suitable for playing tennis on Saturday morning.

ID3 algorithm decision tree Formation

OK, which is part of the decision tree formed after step 1 of ID3 algorithm. In this way, it is easy to understand more. 1. The overcast sample must be positive. Therefore, it is a leaf node and always yes. 2. No Backtracking is performed on ID3. The local optimization is not the global optimization. A decision tree is trimmed after another tree is planted. Is part of the decision tree formed after step 1 of ID3 algorithm:

6.2 basis of Bayesian classification: Bayesian Theorem

Bayesian theorem: We know the probability of a condition and how to obtain the probability after two events are exchanged, that is, how to obtain P (B | A) when we know P (A | B ). Here we first explain what is conditional probability:

It indicates the probability of event a when event B has occurred. It is called the conditional probability of event a when event B has occurred. The basic formula is :.

Bayesian theorem is useful because we often encounter this situation in our lives: We can easily obtain P (A | B) and P (B | A), which is difficult to obtain directly, but we are more concerned with P (B | A). Bayesian theorem is the way we get P (B | A) from P (A | B.

The Bayesian theorem is directly provided without proof below (the formula has been pointed out by netizens and will be verified and corrected later ):

7. Recommended instance Extension

7.1 read recommendations

Let's take a look at a text section (from 36kr ):

"Beijing Technology is also very optimistic about reading recommendation applications. They spent a lot of energy (60 people in a year) and launched the iPhone" Cool cloud reading "today ".

Why have we invested so many people in this reading application? CEO Li Peng told me that more than half of the team is doing background-related things, including semantic analysis, machine learning, and other algorithms. Their purpose is to make the Internet "semantic", clarify people's interests, and finally recommend the content that everyone is interested in to relevant people. On the iPhone, cool cloud's general practice is similar to that on the zite iPad. users' behaviors are "like" and "dislike ", and click the appropriate media source or related tags to tell cool cloud that you want to see more.

This is intended for most recommendation applications, but cool cloud seems more abnormal. In addition to capturing more than 0.1 million articles from the Internet every day, they also indexed the video content broadcast by 200 TV stations nationwide, this allows users to search for videos by text and make the same recommendations on the video content. The general practice is to first record these programs, then convert the sound into text, and finally create a summary and index. "

In general, the algorithms used by recommendation systems are as complex as collaborative filtering described above? The following is a message I posted on Weibo in January 21 ':

Most recommended reading applications generally tag articles based on the content: algorithms, iPhone (click to add weight to this label), and invite comments on the article: Like, or dislike. Each click is recorded by the recommendation system, and the user tag cloud is gradually formed (Meanwhile, similar users can be searched based on the same or similar tag, based on user recommendations). Then, the system retrieves a new article, extracts the keyword of the article, matches the user's tag orientation, and pushes the article.
At present, news reading on mobile phones is classified, such as technology and education. However, it generally does not take the same rating as the web page, so it is impossible to record user behavior characteristics, there will be no new articles for subsequent recommendation and reading services, so we have created a batch of mobile phone recommendation and reading services, such as @ coyun reading and reading.
However, it is common for users to finish reading a piece of news, and choose the day to watch. For example, how many users are willing to register an account to evaluate an article? How to make users use this type of reader at extra cost and change user habits is the key, I personally think.

Then I recorded all these video programs, and then I had a question about converting the voice into text. We already know that, for music, Douban FM may be the following:

You like some songs, and I like some songs. If many of your favorite songs are similar, the system will define you and me as friends, that is, similar users, user-based collaborative filtering recommendation: You may also like it if your friends like it. There is also a recommendation for songs. You like a song, another song, B, is similar to song A (for example, it is related to love and sentiment), so the system guesses that you may also like B and recommend B to you. This is collaborative filtering recommendation based on projects (items.

Based on the similar repetition of the songs that you listen to, you can determine as friends and then perform recommendation based on the user's collaborative filtering. Some songs are similar in the collaborative filtering based on projects, but the problem arises, repeat it to say, how can we define and judge similar music songs? Through the system to analyze the spectrum of songs? How fast is the rhythm of each song, audio? Although this action seems to be effective, it is not practical in practice.

I think tags should be added for those music (it is estimated that the same is true for videos so that they can be searched for indexes in the future. Full video recording is still unreliable). For example, if a tag is marked as "love" and "sentimental", a similar song can be determined if the tag is the same. But the key is how to fight? Speech recognition?

7.2 how to tag

In the initial stage, you can use humans, crawlers, buy databases, and wait for the traffic to come up. You can consider UGC. The so-called UGC, user-generated content. However, users are generally not likely to tag their own music, which is too cumbersome (for example, a "tag" prompt is added to each microblog of Sina Weibo recently, but how many users are willing to handle it ?), Of course, some systems will automatically generate some tags for you (of course, you can also add some tags on your own), such as Sina Blog:

How can we do this? My idea is,

The system should scan your article behind the scenes and extract some keywords as tags for your choice. Which keywords are used? Of course, it is to take high-frequency words. Scan the entire article to calculate the frequency of each word.

Then take the top K, as shown in the above "algorithm" appeared four times in that article, and "blog" appeared three times, so the system automatically matches these tags for you.

Which data structure or method is used to calculate the frequency of these keywords. General applications of hash + heap (11. Thoroughly parse the hash table algorithm from start to end), or trie tree (from the trie tree to the suffix tree. However, when the trie tree faces Chinese characters, it is quite troublesome. Therefore, hash + heap is an ideal choice.

Similarly, for videos, it should be similar: 1. Read the video content through the system or machine, convert the video into text, and then extract the frequently-occurring keywords (how to extract keywords, this involves a key issue: word segmentation. This blog will elaborate later). Use the extracted keywords as the tag of the video. 2. Create an index digest for these tags (what kind of index? Inverted index. For more information about inverted indexes, see Chapter 24th of programming Art: Chapter 23rd and Chapter 4: Young's matrix search and inverted index keyword hash without repeated encoding practices ), finally, it is convenient for users or systems to search in the future (This section is discussed and summarized with friends in the programming art ).

Details will be elaborated later.

8. References

I posted a microblog in (on the left side of this blog) and explored the secrets inside the receng. Zhao chenting, Ma chune, and tobyseganra. Collaborative Filtering overview of Recommendation Systems. Http://www.cnblogs.com/leoo2sk /. Mitchell, Tom M. machine learning. McGraw-Hill, 1997 (the startup of machine learning). http://zh.wikipedia.org/wiki/%e5%86%b3%e7%ad%96%e6%a0%91. Http://www.36kr.com/p/75415.html. Intelligent Web algorithms: Chapter 3 recommendation system (which computes the similarity between users and projects. It is worth noting ).

Postscript

Hero alert: if a friend has previously performed recommendation or retrieval, or machine learning, data mining, massive data processing, or search or recommendation engine, please contact me, you can leave a message or comment at any time, Weibo private message, or email: [email protected]. If you have experience in building large websites or have rich programming experience, feel free to contact me. Technical confidant will be given to files related to this blog at any time for free.

Finally, this blog ranked first in the year of the csdn blog in 2011 in the past year: http://blog.csdn.net/ranking.html, and top 10 in the csdn feed rankings: Listen:

OK. This article is just a preliminary step. We also see many problems and vulnerabilities that need to be improved. At the same time, everything is just my understanding and has not been used in practical work. Therefore, you cannot understand the truth. Everything has to be tested in the future. If you find any problems or errors in this article or this blog, please do not correct them at any time. Thank you very much. . July and 2011.01.12.

Updated: Next, I plan to write a series for the recommendation system: Recommendation System-introduction, recommendation system-advanced, and recommendation system-final. In this process, I need to learn, reference, and learn a lot of materials and papers. If you have good materials or paper recommendations, feel free to let me know.

Click here to view the original article.

Introduction to recommendation engine algorithm learning: collaborative filtering, clustering, and classification

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to recommendation engine algorithm learning: collaborative filtering, clustering, and classification

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support