Web site data statistical analysis tools are webmaster and operators often use a tool, more commonly used by Google Analytics, Baidu statistics and Tencent analysis and so on. The first step in all of these statistical analysis tools is the collection of Web site access data. The current mainstream data collection methods are basically based on JavaScript. This paper will briefly analyze the principle of this data collection, and construct a practical data collection system step by step.
analysis of data collection principle
In short, the Site statistical analysis tool needs to collect the behavior of the user browsing the target site (such as opening a Web page, clicking a button, adding goods to the shopping cart, etc.) and the behavior of additional data (such as the order amount produced by a single Act). Early site statistics tend to collect only one user behavior: the page opens. Then the behavior of the user on the page cannot be collected. This collection strategy can meet the basic traffic analysis, source analysis, content analysis and visitor attributes, such as the common analysis perspective, but, with the wide use of AJAX technology and E-commerce site for E-commerce goals of the needs of the statistical analysis of the increasingly strong, this traditional collection strategy has become incompetence.
Later, Google in its product Google Analytics innovative introduction of customizable data collection scripts, users through Google Analytics defined the extensible interface, only a small number of JavaScript code can be used to implement custom events and custom metrics tracking and analysis. At present, Baidu Statistics, Sogou analysis and other products have copied the Google analysis of the model.
In fact, the rationale and process of the two data collection patterns are consistent, except that the latter one collects more information through JavaScript. Here's a look at the basic principles of data collection for various web site statistics tools.
Process Overview
First, look at the basic process of data collection through a picture.
Figure 1. Basic process of website statistic data collection
First, the user's behavior triggers an HTTP request to the page that is being counted, which is the first thing to do is to open the page. When the page is opened, the embedded JavaScript fragment in the page is executed, and friends who have used the relevant tools should know that the general Web site Statistics tool will require the user to add a small section of JavaScript code to the page, which typically creates a script tag dynamically. And will SRC point to a separate JS file, at this time this individual JS file (Figure 1 green node) will be browser request to and execute, this JS is often the real data collection script. After data collection is complete, JS requests a backend data collection script (backend in Figure 1), a script that is typically a dynamic script disguised as a picture, possibly written in PHP, Python, or other server languages. JS passes the collected data to the back-end script in the form of HTTP parameters, the backend script parses the parameters and logs to the access log in a fixed format, and may grow some cookies for the client to trace in the HTTP response.
The above is a general data collection process, the following Google analysis as an example, each phase of a relatively detailed analysis.
Buried Point script execution phase
To use Google Analytics (hereinafter referred to as GA), you need to insert a section of JavaScript that it provides in the page, which is often referred to as the embedding code. Here is a screenshot of the Google Analytics embedding code that I put in my blog:
Figure 2. Google analysis buried point code
Where _gaq is a global array of GA that is used to place a variety of configurations, each of which has the following format:
Copy Code code as follows:
_gaq.push ([' Action ', ' param1 ', ' param2 ', ...]);
ACTION Specifies the configuration action, followed by the related parameter list. The default burial point code given by GA gives two preset configurations, _setaccount is used to set the site identification ID, which is assigned when the GA is registered. _trackpageview tells GA to track page access once. For more configuration Please refer to: https://developers.google.com/analytics/devguides/collection/gajs/. In fact, this _gaq is used as a FIFO queue, the configuration code does not have to appear in the embedded code, please refer to the above link description.
As far as this article is concerned, the _GAQ mechanism is not the focus, the emphasis is on the code behind the anonymous function, which is what the buried code really needs to do. The main purpose of this code is to introduce an external JS file (ga.js) by creating a script through the Document.createelement method and pointing src to the corresponding ga.js according to the protocol (HTTP or HTTPS), and finally inserting this element into the DOM of the page The tree.
Note Ga.async = True means asynchronous calls to external JS files, that is, do not block browser parsing, to external JS download completed after the asynchronous execution. This attribute is newly introduced by HTML5.
Data collection Script Execution phase
The Data collection script (ga.js) is executed after being requested, and the script typically does the following things:
1, through the browser built-in JavaScript objects to collect information, such as page title (through Document.title), referrer (previous jump URL, through the Document.referrer), User display resolution (via Windows.screen), cookie information (via Document.cookie), and so on.
2, parse _gaq collect configuration information. This may include user-defined event tracking, business data (such as commodity numbers for e-commerce sites, etc.).
3. The data collected in the above two steps are resolved and spliced according to predefined format.
4, request a back-end script, put the information in the HTTP request parameter to carry the back-end script.
The only problem here is that the step 4,javascript request back-end scripting is often Ajax, but Ajax cannot be requested across domains. This ga.js is performed in the domain of the site being counted, and the backend script is in another domain (GA's back-end statistic script is http://www.google-analytics.com/__utm.gif) and Ajax does not work. A common method is that the JS script creates an Image object, points the src attribute of the image object to the back-end script and carries the parameter, at which point the cross domain request backend is implemented. This is why back-end scripts are often disguised as GIF files. You can see Ga.js's request for __utm.gif through an HTTP grab:
Figure 3. HTTP packets for backend script requests
You can see that ga.js has a lot of information when requesting __utm.gif, for example, utmsr=1280x1024 is screen resolution, utmac=ua-35712773-1 is my GA identification ID parsed in _gaq, and so on.
It is noteworthy that __utm.gif may not be requested only when the buried code executes, and if event tracking is configured with _trackevent, the script will be requested when the event occurs.
Because Ga.js is compressed and confusing, the readability is very poor, we do not analyze, concrete later implementation phase I will implement a function similar script.
Back-end Script execution phase
GA's __utm.gif is a script masquerading as a GIF. This backend script typically completes the following few things:
1, to resolve the HTTP request parameters to the information.
2, from the server (WebServer) to obtain some clients can not obtain information, such as guest IP.
3, write the information in the format log.
5, generate a pair of 1x1 empty GIF picture as the response content and the response head Content-type set to Image/gif.
5, in response to the head through the Set-cookie set some of the required cookie information.
Cookies are set because if you want to track a unique visitor, it is common practice to generate a globally unique cookie based on the rule and grow to the user if you find that the client does not have a specified trace cookie on request. Otherwise, the captured trace cookie is placed in the Set-cookie to keep the same user cookie unchanged (see Figure 4).
Figure 4. The principle of tracking unique users through cookies
This approach, though not perfect (for example, when users remove cookies or change browsers to be considered to be two users), is now widely used. Note that if there is no cross station to track the needs of the same user, the cookie can be planted in the domain of the statistics site through JS (GA is doing this), if the whole network is to be unified positioning, then through the back-end script planted in the server domain (we will do this later).
Design and implementation of the system
Based on the above principles, I built an access log collection system myself. Overall, the system is built to do the following:
Figure 5. Access to data collection system work breakdown
Determine what information is collected
For the sake of simplicity, I'm not going to implement the full data collection model for GA, but to gather the following information.
Name |
Way |
Note |
Access time |
Web server |
Nginx $msec |
Ip |
Web server |
Nginx $remote _addr |
Domain name |
Javascript |
Document.domain |
Url |
Javascript |
Document. Url |
Page title |
Javascript |
Document.title |
Resolution |
Javascript |
Window.screen.height & Width |
Color depth |
Javascript |
Window.screen.colorDepth |
Referrer |
Javascript |
Document.referrer |
Browsing clients |
Web server |
Nginx $http _user_agent |
Client language |
Javascript |
Navigator.language |
Visitor identification |
Cookies |
|
Website identification |
Javascript |
Custom Objects |
Buried Point Code
I'll draw on the GA pattern, but I'm not going to use the configuration object as a FIFO queue at this point. The template for a buried code is as follows:
Copy Code code as follows:
<script type= "Text/javascript" >
var _maq = _maq | | [];
_maq.push ([' _setaccount ', ' website Identification ']);
(function () {
var ma = document.createelement (' script '); Ma.type = ' Text/javascript '; Ma.async = true;
MA.SRC = (' https: ' = = Document.location.protocol? ' Https://analytics ': ' http://analytics ') + '. Codinglabs.org/ma.js ';
var s = document.getelementsbytagname (' script ') [0]; S.parentnode.insertbefore (MA, s);
})();
</script>
Here I have enabled the level two domain name analytics.codinglabs.org, the statistic script name is ma.js. Of course there's a little bit of a problem here because I don't have HTTPS servers, so if an HTTPS site deploys code, there's a problem, but let's just ignore it here.
Front-End Statistics script
I wrote a statistical script that was not perfect but was able to do basic work ma.js:
Copy Code code as follows:
(function () {
var params = {};
Document Object Data
if (document) {
Params.domain = Document.domain | | '';
Params.url = document. URL | | '';
Params.title = Document.title | | '';
Params.referrer = Document.referrer | | '';
}
Window Object Data
if (window && window.screen) {
params.sh = Window.screen.height | | 0;
PARAMS.SW = Window.screen.width | | 0;
params.cd = Window.screen.colorDepth | | 0;
}
Navigator Object Data
if (navigator) {
Params.lang = Navigator.language | | '';
}
Parsing _MAQ Configuration
if (_maq) {
for (var i in _maq) {
Switch (_maq[i][0]) {
Case ' _setaccount ':
Params.account = _maq[i][1];
Break
Default
Break
}
}
}
Splicing parameter string
var args = ';
for (var i in params) {
if (args!= ') {
args + = ' & ';
}
args + = i + ' = ' + encodeURIComponent (params[i]);
}
Requesting back-end scripts from an Image object
var img = new Image (1, 1);
IMG.SRC = ' http://analytics.codinglabs.org/1.gif ' + args;
})();
The entire script is placed in an anonymous function to ensure that the global environment is not contaminated. function in the principle section has been explained, no longer repeat. Where 1.gif is the back-end script.
Log Format
The log takes the form of one record per line, with an invisible character ^a (ASCII 0x01,linux can be entered by CTRL + V CTRL + A, followed by "^a" for Invisible characters 0x01), in the following format:
Time ^aip^a Domain name ^aurl^a page Title ^areferrer^a Resolution High ^a resolution wide ^a color depth ^a language ^a client Information ^a user identification ^a site identification
Back-end scripts
In order to be simple and efficient, I intend to use Nginx Access_log to do log collection, but there is a problem is nginx configuration itself logical expression ability is limited, so I chose openresty do this thing. Openresty is a high-performance application development platform based on Nginx expansion, with a number of useful modules internally integrated, the core of which is the integration of LUA through the Ngx_lua module, which enables the presentation of the business through LUA in Nginx configuration files. On this platform I do not do too much introduction, interested students can refer to its official website http://openresty.org/, or here has its author Zhang Yichun (Agentzh) do a very loving introduction openresty slide:http:// agentzh.org/misc/slides/ngx-openresty-ecosystem/, about Ngx_lua can refer to: Https://github.com/chaoslawful/lua-nginx-module.
First, you need to define the log format in the Nginx configuration file:
Copy Code code as follows:
Log_format tick "$msec ^a$remote_addr^a$u_domain^a$u_url^a$u_title^a$u_referrer^a$u_sh^a$u_sw^a$u_cd^a$u_lang^a$ Http_user_agent^a$u_utrace^a$u_account ";
Note that the u_ begins with the variables we will define ourselves, and the others are the Nginx built-in variables.
Then there are two location of the core:
Copy Code code as follows:
Location/1.gif {
#伪装成gif文件
Default_type image/gif;
#本身关闭access_log, log logs via Subrequest
Access_log off;
Access_by_lua "
--User tracking cookie name is __utrace
Local UID = Ngx.var.cookie___utrace
If not UID Then
--if not, generate a trace cookie, the algorithm is MD5 (timestamp +ip+ client information)
UID = NGX.MD5 (Ngx.now () ... ngx.var.remote_addr. ngx.var.http_user_agent)
End
ngx.header[' Set-cookie '] = {' __utrace= ' ... uid ... '; path=/'}
If Ngx.var.arg_domain Then
--Pass subrequest to/i-log log, take parameters and user tracking cookies over
Ngx.location.capture ('/i-log "... ngx.var.args.) ' &utrace= '. Uid
End
";
#此请求不缓存
Add_header Expires "Fri, 1980 00:00:00 GMT";
Add_header Pragma "No-cache";
Add_header Cache-control "No-cache, max-age=0, must-revalidate";
An empty GIF picture #返回一个1 x1
Empty_gif;
}
Location/i-log {
#内部location, external direct access is not allowed
Internal
#设置变量, attention needs unescape.
Set_unescape_uri $u _domain $arg _domain;
Set_unescape_uri $u _url $arg _url;
Set_unescape_uri $u _title $arg _title;
Set_unescape_uri $u _referrer $arg _referrer;
Set_unescape_uri $u _sh $arg _sh;
Set_unescape_uri $u _sw $arg _sw;
Set_unescape_uri $u _cd $arg _cd;
Set_unescape_uri $u _lang $arg _lang;
Set_unescape_uri $u _utrace $arg _utrace;
Set_unescape_uri $u _account $arg _account;
#打开日志
Log_subrequest on;
#记录日志到ma. Log, it is best to add buffer in practical application, the format is tick
Access_log/path/to/logs/directory/ma.log tick;
#输出空字符串
Echo ';
}
To fully explain each of the details of this script is a bit beyond the scope of this article, and use a lot of third-party ngxin modules (all included in the Openresty), the focus of the place I have annotated, you can not fully understand the meaning of each line, Just about knowing that this configuration completes the backend logic we mentioned in the principle section is OK.
Log rotation
The real log collection system accesses the log very much, the time a long file becomes very large, and the log is placed in a file that is not easy to manage. So you usually have to segment the log by a time period, such as splitting a log every day or hour. I am here to cut a log every hour for the obvious effect. I did this by crontab a regular call to a shell script, and the shell script was as follows:
Copy Code code as follows:
_prefix= "/path/to/nginx"
Time= ' Date +%y%m%d%h '
MV ${_prefix}/logs/ma.log ${_prefix}/logs/ma/ma-${time}.log
KILL-USR1 ' Cat ${_prefix}/logs/nginx.pid '
This script moves the Ma.log to the specified folder and renames the Ma-{yyyymmddhh}.log, and then sends the USR1 signal to Nginx to reopen the log file.
Then add a line to/etc/crontab:
Copy Code code as follows:
* * * * * root/path/to/directory/rotatelog.sh starts this script for log rotation at 59 points per hour.
Test
Here you can test whether the system is functioning properly. I buried the relevant points in my blog yesterday, and I can see that ma.js and 1.gif have been properly requested through the HTTP grab:
Figure 6. HTTP packet parsing ma.js and 1.gif requests
You can also look at the request parameters for 1.gif:
Figure 7. Request parameters for 1.gif
The relevant information is indeed placed in the request parameter.
Then I tail open the log file, and then refresh the page, because there is no access log buffer, I immediately got a new log:
Copy Code code as follows:
1351060731.360^a0.0.0.0^awww.codinglabs.org^ahttp://www.codinglabs.org/^acodinglabs^a^a1024^a1280^a24^azh-cn^ amozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) applewebkit/537.4 (khtml, like Gecko) chrome/22.0.1229.94 safari/537.4^ A4d612be64366768d32e623d594e82678^au-1-1
Note that in fact the original log of the ^a is not visible, here I use the visual ^a replacement for easy reading, in addition to IP due to the privacy involved I replaced for 0.0.0.0.
Look at the log web directory, because I have been buried before, so many rotary files have been generated:
Figure 8. Rotation log
About analysis
Through the above analysis and development can roughly understand a Web site statistics Log collection system is how to work. With these logs, we can conduct a follow-up analysis. This article only pays attention to the log collection, therefore does not write too many about the analysis the thing.
Note that the original log is best to retain as much information as possible without doing too much filtering and processing. For example, the above Myanalytics retains the millisecond timestamp rather than the formatted time, and the format of the time is what the latter system does rather than the responsibility of the Log collection system. The following system according to the original log can be analyzed a lot of things, such as the IP library can locate the visitor's region, user agent can get visitors to the operating system, browser and other information, combined with the complex analysis model, you can do flow, source, visitor, region, path analysis. Of course, usually not directly to the original log analysis, but will be the cleaning format after the transfer to other places, such as MySQL or hbase in the analysis.
The analysis part of the work has a lot of open source infrastructure that can be used, for example, real-time analytics can use storm, while off-line analysis can use Hadoop. Of course, in the case of small logs, you can also use the shell command to do some simple analysis, for example, the following three commands can be derived from my blog today from 8 o'clock in the morning to 9 traffic (PV), the number of visitors (UV) and independent IP number (IP):
Copy Code code as follows:
Awk-f^a ' {print $} ' Ma-2012102409.log | Wc-l
Awk-f^a ' {print $} ' Ma-2012102409.log | Uniq | Wc-l
Awk-f^a ' {print $} ' Ma-2012102409.log | Uniq | Wc-l
Other fun things friends can dig slowly.
Reference
GA's Developer Documentation: https://developers.google.com/analytics/devguides/collection/gajs/
An article on the implementation of Nginx Journal: http://blog.linezing.com/2011/11/%E4%BD%BF%E7%94%A8nginx%E8%AE%B0%E6%97%A5%E5%BF%97
About Nginx can refer to: Http://wiki.nginx.org/Main
Openresty's official website is: http://openresty.org
Ngx_lua module can refer to: https://github.com/chaoslawful/lua-nginx-module
This article HTTP grab uses the Chrome Browser developer tool to draw the mind map using Xmind, process and structure diagrams using TikZ PGF