News recommendation System: Content-based recommendation algorithm (Recommender system:content-based recommendation)

Source: Internet
Author: User
Tags keyword list

Because of the development of a news recommendation system module, in the recommendation algorithm this piece involves the content-based recommendation algorithm (content-based recommendation), so take this opportunity, based on their own view of the online information on the classification method of understanding, with as far as possible clear language, The combination of algorithms and their own development of the recommendation module itself, recording these processes, for their own review, but also for everyone to reference ~

Directory

First, content-based recommendation algorithm + TFIDF

Second, the specific implementation skills in the recommendation system

Body one, content-based recommendation algorithm + TFIDF

Mainstream recommendation algorithms can be broadly divided into:

    • Recommendations based on content (similarity)

    • Collaborative filtering based on user/item similarity

    • Hot News Recommendations (what you see in the headlines)

    • Model-based recommendations (enter the model by entering some user features, resulting in recommended results)

    • Mixed recommendation (above 18-like weapons play together!) )

(This article only details content-based recommendations, other recommended methods you can search separately.) )

Concept

recommendation based on content similarity : As the name implies, it is recommended to you the news content similar to what you like to see . The main advantage of the content-based recommendation algorithm is that there is no cold start problem, so long as the user produces the initial historical data, the recommended calculation can begin. And as the user's browsing record data increases, this recommendation will generally be more accurate.

Here are two important key points to first have a basic understanding:

    1. How to know users like to see the news;

Users have a history of browsing records, we can from these user history to browse the news "extract" can represent the main content of news keywords, see which keywords appear most. such as can have "mobile phone", "Computer games", "conference" and so on keywords.

Or, what are the areas in which the news belongs, such as international politics, society, People's livelihood, and entertainment, to find out the most common sources of news that users see? But judging the user's interest in this way is too broad, and even the news in the same field may vary greatly. For example, a user may like a actress, rather than B actress, and if you just think that the user likes entertainment news, the result of the B-Star news kept to the user push, it is certainly not good. And the above keywords can be better to circumvent the problem.

    1. How to judge two news content similar;

Find a way to define user preferences-keywords, then we can naturally think of, could not extract the two news keywords, and then compare their two keywords are not the same? Oh, yes! The idea is correct, but after all, a news can have several keywords, to think of all the same, or more difficult. So we need to make a reasonable quantification of the keyword matching degree of two news.

Then we're going to talk about TFIDF algorithms.

To give you a link to see the specific principles of the TFIDF algorithm, and here is simply explained: The TFIDF algorithm can be returned to us a group of text "keyword-TFIDF value" Word number pairs, these keywords best represent the core content of this text, The key points of these keywords relative to this article are quantified by its TFIDF value.

Well, now that we have a way to extract keywords and quantify the critical level, we can now compare the similarity of two texts. The formula is as follows:
$ $Similarity (A, b) = \sigma_{i\in m}tfidf_a*tfidf_b$$
M is a collection of two articles coincident keywords. This two text of the common keywords of the TFIDF of the product all added together, to obtain the final representation of two text of the similarity value.

      举例:          刚抓进系统的两个新闻,分别提取出关键词与TFIDF值如下:          A新闻:“美女模特”:100,“女装”:80,“奔驰”:40          B新闻:“程序员”:100,“女装”:90,“编程”:30          两篇文章只有一个共同关键词“女装”,故相似度为:80*90=7200。
User Preferences measurement: Preferences keyword list

But the actual operation, the above ideas have a problem, the user has seen the spicy news, each news has a number of key words, we have just grabbed into the system of the news with them a comparison?

In order to solve this problem, we need to introduce something new: preference keyword table .

In fact, it is very well understood: we maintain a map for each user in the database, this map is placed in the "user preferences of the keyword-preference degree" such key-value. And this map at the beginning of course is empty, and from any moment, we can start to track a user's browsing behavior, whenever the user visited a new news, we put the news "keyword-TFIDF value" "inserted" into the user's Preferences keyword table . Of course, this "insert" to consider the keyword table has a pre-inserted keyword in the case, then on this basis, we can pre-inserted keyword TFIDF value directly and the value of the Word table added up.

Of course, considering the storage problem, we can set a limit for the user's favorite keyword table, such as a maximum of 1000 words, of course, the exact value of the actual operation should be adjusted according to the effect.

Interest migration--attenuation mechanism

One last question.

Will we all think that our point of interest may change over time? For example, this time Apple out a new product, I am concerned about, but one months later, I may be completely indifferent to this matter, but perhaps Apple-related keywords have been in my keyword list, that will not cause me still receive similar I have not cared about the recommendation of the news? How to deal with this problem of interest migration?

To solve this problem, we can introduce a attenuation mechanism, that is, each keyword preference in the user's keyword table will remain attenuated for a certain period of time. Considering the possible differences in the TFIDF values of different words are already at different orders of magnitude, we consider using exponential attenuation in the form of relatively fair attenuation. The introduction of a $\lambda$ coefficient, $1>\lambda>0$, we every time, to all users of all the key preferences of the *$\lambda$ attenuation, then completed the simulation of user interest migration process.

Of course, has been decaying, will also make some already completely not interested in the keyword may be attenuated to 0.0000001, still in decay, but also sipilailian in the Word table occupies a position, then naturally, we can set a threshold of L, the specified for each user each attenuation update completed, the Word table preference value is less than L keyword directly clear.

Specific implementation techniques in the recommendation system

Your own implementation of the recommendation system, including collaborative filtering, content-based recommendations and recommendations based on hot news, on GitHub, welcome to shoot Bricks!

Here TFIDF value Extraction I use is ANSJ, has the direct TFIDF library function, the direct call on the line, do not have their own participle.

And in the database to store and read the user's keyword table, I use the JSON form, related tools have Fastjson and Jackson, everyone choose their favorite use can.

In addition, the recommendation process is done with Quartz scheduled task library customization at 0 o'clock every day, including like attenuation mechanism, each recommendation algorithm generates their own recommendation results, all this time. So this recommendation is not real-time, of course, made real-time completely no problem, as long as the server can be good.

Something

Here just put forward one of their own ideas, the process of thinking is also seen in many of the relevant academic literature recommendation system and carried out their own summary and change, not authoritative approach, welcome to propose amendments.

Heard a few years ago, ACM has an annual recommendation system for the academic conference called Recsys, interested in the small partners can also pay attention to.

Have a question welcome private messages to me!

News recommendation System: Content-based recommendation algorithm (Recommender system:content-based recommendation)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.