The---of data mining project to implement content recommendation system by mining Web log

Last Update:2014-09-03 Source: Internet

Author: User

Tags varnish

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First talk about the problem, do not know that everyone has such experience, anyway, I often met.

Example 1, some websites send e-mails to me every few days, each e-mail content is something I do not interest at all, I am not very disturbed, to its abhorrence.
Example 2, add a feature of a MSN robot, a few times a day suddenly pop out a window, recommend a bunch of things I don't want to know, annoying ah, I had to stop you.

Every audience just want to see what he is interested in, rather than something unrelated to it, then how to know the audience's interests, or data mining, after some thinking, finally a little idea, that is, according to the user's past browsing history to predict the future behavior of users, that is, content-based recommendations.
Content-based recommendations (content-based recommendation) is the continuation and development of information filtering technology, which is based on the content information of the project to make recommendations, and does not need to be based on user evaluation of the project opinion, It is more necessary to use machine learning methods to get the user's interest information from the case of the characteristic description of the content. In the content-based recommender system, the project or object is defined by the attributes of the relevant characteristics, and the system is based on the characteristics of the user's evaluation object, learns the user's interest, and examines the matching degree between the user data and the project to be predicted. The user's data model depends on the learning method used, the decision tree, the neural network and the vector-based representation method. Content-based user data is a historical data that requires a user, and the user profile model may change as the user's preferences change.

The advantages of the content-based recommendation approach are:
1) No data for other users, no cold start problems and sparse issues.
2) can be recommended for users with special interests and hobbies.
3) can recommend new or not very popular projects, no new project issues.
4) by listing the content characteristics of the recommended items, you can explain why those items are recommended.
5) There are already relatively good technologies, such as the classification of learning technology is quite mature.

The disadvantage is that the content can be easily extracted into a meaningful feature, requiring a good structure of the feature content, and the user's tastes must be able to be expressed in the form of content features, and can not be explicitly judged by other users.

There are 4 major steps to implement a content recommendation system in general:
1 collection of data, that is, the collection of user behavior data, including many methods, according to the information I found and past experience, the Web log can be used as our entry point, that is, our data source.

2 Filter The data, there is a lot of useless information in the Web log, we want to exclude this useless information, and to distinguish between the user and log data connection.

3 Analysis of data, using classification clustering technology to analyze the correlation between these log data, and the correlation between these log data and users, this is also the most important step.

4 output results.

With this idea, we can start with the first step, that is, the collection of log data
We know that most Web servers have their own log records, such as Apache after the installation of a logs directory, which has its log files, in general, it has its own format, such as:
1 The IP address (IP) of the host on which the browser resides, 2 access date and time (Date-time), 3 method used by the client to communicate with the server (Methed,get or POST), 4 the URL of the client requesting access to the page, and 5 the status of the server return (status); 6 types of client browsers;

But this log file has some insurmountable problems, or I do not know how to overcome, then I first say my question, first of all, this log file is recorded in the IP address, it is understood that there are many computers in the network IP address is the same, because they are in a unified route behind, this ratio may reach 25%. Then we cannot uniquely identify a user based on the IP address. Second, there are multiple applications in a typical Web server, so access information from other applications can be redundant for us. In addition, the Web server log form is relatively simple, flexibility is small, there is little room for customization, in the log data accounted for a small proportion of valid data. Also, some static file requests will be recorded by the Web server, such as JS files, CSS files, as well as picture files, and so on, these things are useless resources for content recommendation.

Based on the above 3 reasons, I think you can customize the log data. In order to solve the uniqueness of the user, we let the app generate a clientid for each browser and save it on the corresponding browser so that as soon as the browser accesses the site, we can determine the uniqueness of the browser, and of course we still cannot determine the uniqueness of the browser user, but we can go further, If the users of the browser log on to the site, we can use the user ID to determine the uniqueness of the user, but most users of the site may be in the use of the site and will not login, I also do not have a relationship, even if the use of clientid problem is not too big, with the development of society, The amount of computer ownership is increasing, in general a person will only use a fixed computer, especially in the company. So I think ClientID's plan is feasible, perhaps someone to ask, other people's browser ban cookie how to do, then I can only say no way, but fortunately the fact is that most people did not do so.

Next we can define the format of the log data we need, such as this,
Ip,clientid,userid,url,datetime,get or post and so on.
This will greatly improve the data availability.

After getting more effective data, we also need to filter the data again:
1 Remove some non-content URLs, which are also invalid data, these non-content URLs need our own statistics, and then compare with the data in the log data, to remove these non-content data from the log data.
2 at the same time we need to remove the POST request from the log data, or we should not record the POST request at all when logging the log.

After the above steps, we can start the 3rd phase, the URL of each user's visit, access to these URLs, the corresponding HTML contains the data, the data are text, the useful text extracted, and then the useful text to cluster. This will give you a few categories that each user likes.

After the cluster is complete we can begin to classify, that is, the latest articles or content and corresponding categories to match, after the successful match, we can think that the new article or content can be recommended to the corresponding user.

Problem: The above process only applies to systems that do not use the cache, but generally large web sites use Varnish,squid and so on, we cannot get the log data of user access after using them, so if you use varnish or squid, We have to face the Web server's log data again.

In the case of varnish or squid, the above recommendation system can be achieved using Lucene+jamon+htmlparse.

http://www.iteye.com/topic/169512

The---of data mining project to implement content recommendation system by mining Web log

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More