Come with me. Data Mining (20)--site log mining

Source: Internet
Author: User

Purpose of collecting web logs

Web log mining refers to the use of data mining technology, the site user access to the Web server process generated by the log data analysis and processing, so as to discover the Web users access patterns and interests, such information on the site construction potentially useful and understandable unknown information and knowledge, for the analysis of the site's access to the situation, Secondary site management and decision support.

1, to improve web site design as the goal, by mining user clustering and user's frequent access path, modify the link between the pages of the site to adapt to the user's access habits, and at the same time to provide users with targeted e-commerce activities and personalized information services, the application of information push-pull technology to build an intelligent web site.

2, in order to analyze the Web site performance as the goal, mainly from the statistical point of view, the log data items for rough statistical analysis, the user frequently access pages, the number of visits per unit of time, the number of visits over time distribution map. Most of the existing Web log analysis tools belong to this class.

3, to understand the user intent as the goal, mainly through the process of user interaction with the user's information collection, the Web server based on this information on the user's request to cut the page, for users to return customized pages, the purpose is to improve user satisfaction and provide personalized services.

Collection method

There are three main ways of collecting web analytics data: Web logs, JavaScript tags, and packet sniffers.

1. Web logs

Web Log processing Flow:

You can see that the collection of site Analytics data starts with HTTP requests from site visitors entering URLs to the site server. When the Web server receives the request, it appends a record to its own log file, including the remote hostname (or IP address), the login name, the login name, the date of the request, the time of the request, the details of the request (including the requested method, address, protocol), the status of the request returned, and the size of the requested document. The Web server then returns the page to the visitor's browser for presentation.

2. JavaScript tags

JavaScript tag Processing flow:

The JavaScript tag shown is the same as the Web log collection data, starting with the HTTP request from the site visitor. The difference is that the code of the page that the JavaScript tag returns to the visitor contains a special piece of JavaScript code that is executed when the page is displayed. This code obtains detailed information from the visitor's cookie (access time, browser information, the tool vendor assigns the current visitor's userid, etc.) and sends it to the tool vendor's data collection server. The data collection server is stored in the database after the collected data is processed. The website operator views the data by accessing the analytics reporting system.

3. Packet sniffing device

The process of collecting analysis through packet sniffers:

You can see that a request from a site visitor arrives at the Web server before it passes through the packet sniffer before the packet sniffer sends the request to the Web server. The data collected by the packet sniffer is stored in the database after the tool vendor's processing server. The website operator can then see the data through the analysis report system.

Web Log Mining process

Overall process reference:

1. Data preprocessing phase
According to the purpose of mining, the data in the original Web log file is extracted, decomposed, merged, and finally converted into a user session file. This phase is the most critical phase of Web access information mining, including: Preprocessing of user access information, preprocessing of content and structure.

2. Session Recognition phase
This phase is part of the data preprocessing phase, which is divided into a separate stage, because the user session file is divided into a group of user session sequence will be directly used in mining algorithm, its precision directly determines the quality of the excavation results, is the most important stage of the excavation process.

3. Pattern Discovery Stage
Pattern discovery is the use of various methods and techniques to excavate and discover the various potential laws and patterns of user Web use from web gay data. The algorithms and methods used in pattern discovery not only come from the field of data mining, but also include other areas of specialization such as machine learning, statistics and pattern recognition.

The main techniques for pattern Discovery are: statistical analysis (statistical analyses), Association Rules (Association Rules), Clustering (clustering), collation (classification), sequence patterns (sequential Patterns), dependency relationship (dependency).

(1) statistical analysis (statistical): Common statistical techniques are: Bayesian theorem, predictive regression, logarithmic regression, logarithmic-linear regression, and so on. can be used to analyze the frequency of Web page access, the time of access to the page, access path. Can be used for system performance analysis, identify security vulnerabilities, for site modification, market decision-making support.

(2) Association Rule (Association Rules): Association rules is the most basic mining technology, but also the most common method of wum. Often used in wum pages, this facilitates the optimization of site organization, site designers, site content managers and market analysis, through market analysis can know which products are frequently purchased, which customers are potential customers.

(3) Clustering (clustering): Clustering technology is to find a similar group of objects in the mass data, which is based on distance function to find the similarity between groups of objects. In Wum, users with similar patterns can be divided into groups, which can be used for market fragmentation in e-commerce and to provide personalized services to users.

(4) Classification (classification): The main purpose of classification technology is to classify user data into a particular class, which is closely related to machine learning. The techniques available are: Decision Tree, K-nearest neighbor, Naïve Bayesian classifiers, Support vector Machine (decision).

(5) sequence pattern (sequential patterns): Given a set of different sequences, each sequence is ordered in order by different elements, each element is composed of different items, and given a user-specified minimum support threshold value, Sequential pattern mining is to find all the frequent sub-sequences, that is, the occurrence of the subsequence in the sequence set is not less than the user-specified minimum support threshold value.

(6) Dependency relationship (dependency): A dependency exists between two elements, and b depends on a if the value of one element a can roll out the value of another element B.

4. Mode Analysis Stage
Pattern analysis is the last step of web use mining, the main purpose is to filter the rules and patterns generated by the pattern discovery phase, to remove those useless patterns, and to show the patterns of discovery through certain methods. Because the web uses mining in most cases belongs to non-biased learning, it is possible to dig out all the patterns and rules, so you can not exclude some of the patterns are common sense, ordinary or end users are not interested in, it is necessary to adopt the method of pattern analysis to make the rules and knowledge is readable and ultimately understandable. The common pattern analysis methods include graphics and visualization techniques, database query mechanism, mathematical statistics and usability analysis.

Data collection includes

The data collected mainly include:

The global UUID, access date, access time, the IP address of the server that generated the log entry, the action the client attempted to perform, the server resource that the client accessed, the query that the client was trying to execute, the port number the client was connected to, the authenticated user name of the access server, the client IP address of the sending server resource request, The operating system, browser and other information used by the client, the status code of the operation (200, etc.), the state of the sub-State, the status of the operation, and the number of clicks in the terms used in [email protected].

User identification

For the operators of the site, how to effectively and accurately identify the user is very critical, which will greatly help the operation of the site, such as targeted recommendations.

The user identification method is as follows:

Using HDFs Storage

After the data is collected to the server, you can consider storing the data in HDFs in Hadoop, depending on the amount of data.

If you are unfamiliar with HDFs, you can refer to:

http://www.niubua.com/?p=1107

In today's enterprise, in general, more than one server generates logs, including Nginx generated, and also included in the program using LOG4J generated custom format.

The usual architecture is as follows:

Analyzing Nginx logs with MapReduce

The default log format for Nginx is as follows:

222.68.172.190--[18/sep/2013:06:49:57 +0000] "get/images/my.jpg http/1.1" 19939 "http://www.angularjs.cn/A00n" "M ozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 safari/537.36 "

The variables are interpreted as follows:

    • REMOTE_ADDR: Log The IP address of the client, 222.68.172.190
    • Remote_user: Log The client user name, –
    • Time_local: Record access time and time zone, [18/sep/2013:06:49:57 +0000]
    • Request: Log the requested URL with the HTTP protocol, "Get/images/my.jpg Http/1.1″
    • Status: Record request state, success is 200, 200
    • Body_bytes_sent: Record sent to client file principal content size, 19939
    • Http_referer: Used to record access from that page link, "http://www.angularjs.cn/A00n"
    • Http_user_agent: Record information about the customer's browser, "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/29.0.1547.66 S Afari/537.36″

You can use MapReduce directly for log analysis:

In Hadoop, it is timed to be imported into a relational database for presentation.

For detailed analysis, you can refer to this article:

Http://www.tuicool.com/articles/2ANJZz

You can also use hive instead of mapreduce for analysis.

Summarize

Web Log collection is the process that every Internet enterprise must process, when collects data, and through the appropriate data mining, will have the overall website's operation ability and the website optimization to bring the quality enhancement, truly achieves the data analysis and the data operation.

Come with me. Data Mining (20)--site log mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.