Data Mining recommendation system implementation

Source: Internet
Author: User

Author: Zhang Ronghua

Let's talk about the problem first. I don't know if you have such experience. I often encounter this problem.

Example 1: Some websites send emails to me every few days. The content of each email is something that I don't even want to be interested in. I am not disturbing it and hate it.
Example 2: add an MSN robot with a certain function. A window pops up several times every day. I recommend a bunch of content that I don't even want to know, I had to stop you.

Every viewer only wants to see what he is interested in, rather than something unrelated to him. How can he know the audience's interest? Data Mining? after some thought, I finally got a bit of thinking, that is, you can predict your future behavior based on your previous browsing history, that is, content-based recommendations.
Content-based recommendation is a continuation and development of information filtering technology. It is based on the content information of the project, instead of relying on users' comments on the project, we need to use machine learning to obtain users' interest information from the feature descriptions of the content. In a content-based recommendation system, items or objects are defined by the relevant feature attributes. The system learns users' interests based on the features of user evaluation objects, measure the test taker's knowledge about the matching between user data and the project to be predicted. The user data model depends on the learning method used. commonly used data models include decision trees, neural networks, and vector-based representation methods. Content-based user data requires historical user data. User data models may change with user preferences.

The advantages of the content-based recommendation method are:
1) there is no need for data from other users, and there is no cold start or sparse problem.
2) can be recommended for users with special interests.
3) recommend new or not popular projects without any new project problems.
4) by listing the Content features of Recommendation projects, you can explain why those projects are recommended.
5) There are already good technologies, such as those related to classification learning.

The disadvantage is that the content can be easily extracted into meaningful features, and the feature content must have good structure, and the user's taste must be expressed in the form of Content features, other users cannot be judged explicitly.

There are four major steps to implement the content recommendation system:
1. Collect data, that is, collect user behavior data. There are also many methods. Based on the information I have found and previous experiences, Web logs can be used as our starting point, that is, our data source.

2. filter data. There is a lot of useless information in Web logs. We need to remove this useless information and distinguish between users and log data.

3. analyze data and use the classification clustering technology to analyze the associations between these log data and users. This is also the most important step.

4. output results.

With this idea, we can proceed with the first step, that is, log data collection.
We know that most Web servers have their own log records. For example, after Apache is installed, there is a logs directory with its log files. Generally, it has its own format, for example:
1. IP address (IP address) of the host where the browser is located; 2. Date-time (date-time); 3. methed, get or post) used for client-server communication ); 4. the URL requested by the client to access the page; 5. The status returned by the server; 6. The browser type of the client;

But this log file has some insurmountable problems, or I don't know how to overcome them, so let me first talk about my questions. First, this log file records IP addresses. It is understood that, the IP addresses of many computers in the network are the same, because they are behind a unified route and the ratio may reach 25%. In this case, we cannot uniquely identify a user based on the IP address. Second, generally multiple applications are used in Web servers, so the access information of other applications may be redundant for us. Furthermore, the web server's log format is relatively simple, with little flexibility and little customization space. The proportion of valid data in log data is small. In addition, some static file requests will also be recorded by the Web server, such as JS files, CSS files, and image files, which are useless resources for content recommendation.

Based on the above three reasons, I think the log data can be customized. To solve user uniqueness, we asked the application to generate a clientid for each browser and save it to the corresponding browser. In this way, we can determine the uniqueness of the browser as long as the browser accesses the website, of course, we still cannot determine the uniqueness of browser users, but we can proceed further. If browser users log on to the website, we can use the user ID to determine the uniqueness of users, however, most website users may not log in when using the website. I am also like this. It doesn't matter. Even if the clientid is used, the problem will not be too great. With the development of society, the number of computers is gradually increasing, generally, a person only uses a fixed computer, especially in the company. Therefore, I think the clientid solution is feasible. Some people may ask, what should I do if someone else's browser prohibits cookies? I can only say there is no way, but fortunately, most people do not.

Next we can define the format of the log data we need, for example,
IP, clientid, userid, URL, datetime, get or post, etc.
In this way, the data validity will be greatly improved.

After obtaining valid data, we need to filter the data again:
1. Remove non-content URLs and the data is invalid. We need to manually calculate these non-content URLs and compare them with the data in the log data, clear non-content data from log data.
2 At the same time, we also need to clear POST requests from log data, or we should not record post requests when logging.

After completing the preceding steps, we can start the 3rd phase, count the URLs accessed by each user, access these URLs, and obtain the data contained in the corresponding HTML, these data are all texts, extract useful texts, and then cluster these useful texts. In this way, you can obtain several categories that each user prefers.

After clustering, we can start to classify the latest articles or content and corresponding categories. After matching, we can recommend this new article or content to corresponding users.

Problem: the above process only applies to systems that do not use the cache, but generally large websites use varnish, squid, and so on. After using them, we will not be able to get the log data accessed by users, so if varnish or squid is used, we have to face the log data of the Web server again.

Without varnish or squid, the above recommendation system can be implemented by using Lucene + jamon + htmlparse.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.