Why do you want to do statistical analysis of website traffic data?
With the advent of the Big Data era, the data generated by all walks of life are exploding, the technology of big data from the previous "nothingness" to become possible, the potential value of the data generated by people slowly tapped out to use in all walks of life. For example, website traffic statistics analysis, can help webmasters, operators, promotional personnel and other real-time access to website traffic information, and from traffic sources, web site content, website visitor characteristics and other aspects of the site to provide data based analysis. This helps to increase website traffic, enhance the user experience of the site, and allow more visitors to settle down to become members or customers and gain maximum revenue through less investment.
Analysis of data collection principle of website traffic log
First, the user's behavior triggers an HTTP request from the browser to the page being counted, such as opening a Web page. When the page is opened, the embedded JavaScript code in the page is executed.
Buried point refers to: in the Web page pre-add a small piece of JavaScript code, this code snippet will be dynamically created a script tag, and the SRC attribute to a separate JS file, this time the individual JS file (green node in the diagram) will be requested by the browser and execution, this JS is often A real data collection script.
After the data collection is completed, JS requests a backend data collection script (backend in the figure), which is usually a dynamic script disguised as a picture, JS will collect the data through the way of HTTP parameters to the back-end script, the back-end script parsing parameters and recorded in a fixed format to the access log, It may also be possible to grow some cookies for tracing in the HTTP response to the client.
Design implementation
Based on the principle analysis and the combination of Google analytics, want to build a custom log data acquisition system, to do the following:
Determining the collection of information
Name |
Way |
Note |
Access time |
Web server |
Nginx $msec |
Ip |
Web server |
Nginx $remote _addr |
Domain name |
Javascript |
Document.domain |
Url |
Javascript |
Document. Url |
Page title |
Javascript |
Document.title |
Resolution |
Javascript |
Window.screen.height & Width |
Color depth |
Javascript |
Window.screen.colorDepth |
Referrer |
Javascript |
Document.referrer |
Browse clients |
Web server |
Nginx $http _user_agent |
Client language |
Javascript |
Navigator.language |
Visitor ID |
Cookies |
Nginx $http _cookie |
Website logo |
Javascript |
Custom Objects |
Status code |
Web server |
Nginx $status |
Send Internal capacity |
Web server |
Nginx $body _bytes_sent |
Determine the buried Point code
Buried point, is a common method of data collection for website analysis. The core is to collect the data by embedding the statistical code in the key points that need to be collected. For example, in the case of Google Analytics prototypes, it is necessary to insert a piece of JavaScript into the page that is often referred to as the buried point code. (Take Google's embed code for example)
<script type="Text/javascript">var_maq = _maq | |[];_maq.push (['_setaccount','Ua-xxxxx-x']);(function () {varMa = document.createelement ('Script'); Ma.type ='Text/javascript'; Ma.Async=true; MA.SRC= ('https:'= = Document.location.protocol?'Https://ssl':'http://www') +'. Google-analytics.com/ma.js';vars = document.getElementsByTagName ('Script')[0];s.parentnode.insertbefore (M A, s);}) ();</script>
Where _maq is a global array that is used to place various configurations, each of which is configured in the following format:
_maq.push ([' Action ', ' param1 ', ' param2 ', ...]);
_maq mechanism is not the focus, focus on the code behind the anonymous function, the main purpose of this code is to introduce an external JS file (ma.js), the way is to create a script through the Document.createelement method and according to the Protocol (HTTP or HTTPS) points src to the corresponding ma.js and finally inserts the element into the DOM tree of the page.
Note Ma.async = True means asynchronous invocation of an external JS file, which does not block the parsing of the browser, to be executed asynchronously after the external JS download is complete. This property is newly introduced by HTML5.
Front-end data collection scripts
When the data collection script (Ma.js) is requested, it is executed, typically following a few things:
Collect information through the browser's built-in JavaScript objects, such as the page title (via Document.title), referrer (previous hop URL, via Document.referrer), User display resolution (via Windows.screen), cookie information (via Document.cookie), and more.
- Parses the _maq array and collects configuration information. This may include user-defined event tracking, business data (such as the e-commerce website's product number, etc.).
- The data collected in the above two steps is parsed and spliced (GET request parameters) in a predefined format.
- Request a back-end script that puts the information in the HTTP request parameter to the back-end script.
The only problem here is that the step 4,javascript request back-end script is commonly used by Ajax, but Ajax cannot be requested across domains. A common approach is to create an image object from the JS script, point the SRC attribute of the image object to the back-end script and carry the parameters, and the cross-domain request backend is implemented. This is why back-end scripts are often disguised as GIF files.
Sample code
(function () {var params= {};//Document Object Dataif(document) {params. Domain = Document.domain | |"'; params. url = document. URL | |"'; params. title = Document.title | |"'; params. referrer = Document.referrer | |"'; }//Window Object Dataif(Window &&window.screen) {params. sh = window.screen.height | |0;params. SW = Window.screen.width | |0;params. cd = Window.screen.colorDepth | |0;}//Navigator Object Dataif(navigator) {params. lang = Navigator.language | |"'; }//parsing the _maq configurationif(_maq) { for(varIinch_maq) {Switch(_maq[i][0]) { Case '_setaccount':params. Account = _maq[i][1]; Break;default: Break;}}}//splicing parameter Stringvarargs ="'; for(varIinch params) {if(Args! ="') {args+='&';} Args+ = i +'='+ encodeURIComponent (params[i]);}//request back-end script via Image objectvarIMG =NewImage (1,1); IMG.SRC='http://xxx.xxxxx.xxxxx/log.gif?'+args;}) ();
The entire script is placed in an anonymous function to ensure that the global environment is not polluted. Where Log.gif is the back-end script.
Back-end scripts
Log.gif is a back-end script, a script disguised as a GIF image. Back-end scripts typically need to do several things:
- Parse HTTP request parameters to get information.
- Obtain from the WEB server some information that the client cannot obtain, such as the guest IP.
- Writes information to log in a format.
- Generates a 1x1 empty GIF image as the response content and sets the Content-type of the response header to Image/gif.
- Set some required cookie information in the response header via Set-cookie.
The reason to set a cookie is that if you want to track a unique visitor, it is common practice to generate a globally unique cookie based on the rule and plant it to the user if the client does not have the specified trace cookie on the request, otherwise the tracked cookie that is obtained in Set-cookie is placed To keep the same user cookie intact. While this is not perfect (for example, a user clearing a cookie or changing a browser is considered to be two users), it is now widely used.
We use Nginx Access_log to do log collection, but there is a problem is the Nginx configuration itself is limited in the logic of expression, so choose openresty do this thing.
Openresty is a high performance application development platform based on Nginx, which integrates many useful modules, the core of which is the integration of LUA through the Ngx_lua module, so that the business can be expressed through LUA in the Nginx configuration file.
Lua is a lightweight, compact scripting language, written in standard C and open in source code, designed to be embedded in applications, providing flexible extension and customization capabilities for applications.
First, you need to define the log format in the Nginx configuration file:
Log_format Tick " $msec | | $remote _addr| | $status | | $body _bytes_sent| | $u _domain| | $u _url|| $u _title| | $u _referrer| | $u _sh| | $u _sw| | $u _cd| | $u _lang| | $http _user_agent| | $u _account";
Note that the u_ starts with a variable that we will define ourselves later, and the other is the Nginx built-in variable. Then the two location of the core:
Location/log.gif {#伪装成 gif file Default_type image/gif; #本身关闭 access_log, subrequest record Logaccess_log Off;access_by_lua"--user Tracking cookie named __utracelocal UID= Ngx.var. Cookie___utraceifNot uid then--if not, a trace cookie is generated and the algorithm is MD5 ( timestamp+ip+client information) UID=ngx.md5 (Ngx.now (). Ngx.var. remote_addr. Ngx.var. Http_user_agent) End ngx.header['Set-cookie'] = {'__utrace='.. UID:'; path=/'}ifNgx.var. Arg_domain Then--through Subrequest request to/i-Log logs to take the parameters and user tracking cookie over ngx.location.capture ('/i-log?' .. Ngx.var. args.'&utrace='.. uid)End";#此请求资源本地不缓存add_header Expires"Fri, 1980 00:00:00 GMT"; Add_header Pragma"No-cache"; Add_header Cache-control"No-cache, max-age=0, must-Revalidate";#返回一个1X1empty gif picture of empty_gif;} Location/i-Log {#内部 location, external direct access not allowedInternal, #设置变量, Attention needs unescape, from Ngx_set_misc module Set_unescape_uri $u _domain $arg _domain;set_unescape_uri $u _url $arg _url; Set_unescape_uri $u _title $arg _title;set_unescape_uri $u _referrer $arg _referrer;set_unescape_uri $u _sh $arg _sh;set_ Unescape_uri $u _sw $arg _sw;set_unescape_uri $u _cd $arg _cd;set_unescape_uri $u _lang $arg _lang;set_unescape_uri $u _ Account $arg _account, #打开日志log_subrequest on; #记录日志到 ma.log format is Tickaccess_log/path/to/logs/directory/ma.log tick; #输出空字符串echo"';}
This script uses a number of third-party ngxin modules (all included in the Openresty), the emphasis is annotated with annotations, you can not fully understand the meaning of each line, as long as the configuration is about to know the completion of the backend logic we mentioned.
Log format
The log format mainly considers the log delimiter, generally has the following several choices:
A fixed number of characters, tab separators, space separators, one or more characters, and specific start and end text.
Log slicing
Log collection system access log time a long file becomes large, and the log is placed in a file that is not easy to manage. It is common to slice logs by time period, such as splitting a log daily or hourly. Invoke a Shell script implementation via crontab, as follows:
_prefix="/path/to/nginx"time = ' Date +%y%m%d%H ' mv ${_prefix}/logs/ Ma.log ${_prefix}/logs/ma/ma--usr1 ' Cat ${_prefix}/logs/nginx.pid '
This script moves ma.log to the specified folder and renames to Ma-{yyyymmddhh}.log, then sends an USR1 signal to Nginx to reopen the log file.
USR1 is often used to tell the application to overload a configuration file, sending a USR1 signal to the server will cause the following steps to occur: Stop accepting new connections, wait for the current connection to stop, reload the configuration file, reopen the log file, and restart the server for relatively smooth, non-shutdown changes.
Cat ${_prefix}/logs/nginx.pid Process number to take Nginx
Then add a line in the/etc/crontab:
* * * * * root/path/to/directory/rotatelog.sh
Start this script at 59 minutes per hour for log rotation operations.
Web site traffic log data custom acquisition implementation