Imperfect Web Analytics data: idealized data and an idealized visitor

Last Update:2014-12-08 Source: Internet

Author: User

Keywords Ideal analytical each

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

I. Idealized technology

Each method of data collection has its own unique technical advantages, but there is no way to capture the perfect collection of visitors on the site all the action, each technology will be due to their own limitations cause you see the data is not perfect data. For example, to calculate the page dwell time, the following image is an access time record: (The time in the diagram is the moment to enter the page)

The usual way to calculate the length of a page is: The current page's entry time is worse than the next page's entry time. From this, we know that the rest of the pages in the previous example are as follows:

Page A: 5 minutes

Page B: 1 minutes

Page C: 4 minutes

Page d:?

Why does page D stay time? Yes, no matter which collection method can capture the exact time of page D, the reason is simple, these data collection methods can not capture the time when the visitor left (or on the exit page to stay for half a day without any clicks, or directly closed the browser). So the different tool vendors to exit the page's stay time has a different definition, some unified calculation for 1 minutes, some simply think it is 0 minutes.

At present, there are mainly several technologies or restrictions on the acquisition of data, or confusing existing collected data.

1. Caching

The cache is not meant to be a physical chip, such as a CPU cache, but to conserve network resources, to improve the speed of browsing the Web browser cache or proxy server cache. A simple understanding of these two caches is to store the content of the Web pages that have been visited (including pictures and cookie files) on a computer or proxy server. When you call a page that you read previously, you can simply pull out the contents of the cache without having to transfer the data again from the Web server.

The following image is the file record left in the local cache folder after accessing a Web site:

Because when a visitor accesses a Web site through a local cache and does not send a request to the Web server, the log record for this visit does not naturally exist in the server. That is to say, the data collected through the Web log is bound to lose this portion of traffic.

2. Web crawler

If you want to explain the search engine crawler principles and algorithms I am afraid that a single chapter is not enough, and is not the content of this book, so here is no longer to repeat.

The following is a list of search engine crawler records in a Web server log:

203.208.60.178 [10/nov/2011:12:00:00 +0800] "-" "get/index.php http/1.1" 30000 "-" "mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html) "

Log from the above can be seen: November 10, 2011 12:00:00 time, Google's Googlebot (Google's search engine crawler name) visited and crawled home/index.php.

This means that this part of the data is mixed in the data collected by the Web log. At the same time, it needs to be reminded that the crawler's visit to the Web server just to download the main information, the content of the Web page will not be like a netizen visited in the browser can be displayed; in other words, the JavaScript data collection code in the source code of the Web page cannot be executed.

3. Firewall

Because the mechanism of the firewall is more complex, there is no detailed explanation, interested in Wikipedia or other resources to understand.

Simple understanding of the firewall function, you can think that it is in the network according to the level of trust, to control the data flow back and forth. It's like a filter screen that monitors the flow of data through its filtering attempts.

Firewalls that provide security to the network may prevent JavaScript scripts from sending data to the data collection server. This will undoubtedly cause the JavaScript tag to lose part of the traffic.

Ii. Ideal for visitors

Web analytics are primarily designed to track visitors ' actions on the site, but are often influenced by the behavior of their personal computers. Maybe this is the gap between the ideal and the reality, because you can not ask all the users in accordance with the way you want to travel online.

1. IP Settings

When the Web log collects data, it relies primarily on the IP of the visitor to differentiate between unique visitors, but it is inevitable that the data collected when the following dynamic IP allocation occurs.

A machine using a different IP is likely to result in more visitors being counted than it actually is. Can be seen that the site analysis tool is not actually the actual number of visitors, but a IP or a browser. So let alone whether multiple individuals using the same computer can be correctly counted.

2. JavaScript Effective Settings

Some visitors may choose to turn off the JavaScript settings in their browsers in order to be secure, losing more than a few web effects, and for the JavaScript-tagged tool maker, there is a loss of all the action records on the target site.

3. Cookie Settings

(1) Disabling cookies

The popularization of the Internet will bring people into a comprehensive information society, people's awareness of the protection of personal information is gradually strengthened. Because of the sensitivity of privacy information, some people choose to disable cookies.

(The above figure can see that the cookie settings can be divided into the first party cookies and Third-party cookies two settings, as for the difference between the two cookies are interested in access to online information)

Leave cookies, the use of JavaScript tags will not distinguish between the number of visits and unique identity visitors, without these two basic metrics, web analytics can do is not much. So, the disabling of cookies is a huge blow to the JavaScript tag collection data.

(2) Delete cookies

People often delete cookies for information protection reasons.

A regular or irregular deletion of a cookie directly results in a greater number of unique identity visitors than the actual number. Because if a cookie is deleted, it will be rebuilt with a new cookie, so that the same visitor will be repeatedly counted.

(3) Multi-browser

Even the same Web site will have different cookies on the same computer depending on the browser.

As you can see from the diagram above, when the same visitor uses three different browsers to access the site, the JavaScript tag counts the person to three people because of the cookie's difference.

In the face of such bad data, what can we do to avoid the error data caused by the analysis of the misunderstanding?

Iii. How to face imperfect data

As can be seen in the previous discussion, not only do different data collection methods have a direct impact on the statistical results, but many technical and perceived factors can have various effects on the statistical results. In the face of such "bad" data, how can we gain insights into the guiding principles of action?

Let's take a look at the statistical results of Google Analytics statistics and analysis for a certain period of time

(Note: The above report is just for the simple description is not true, the data and format are fictitious)

The first glimpse of the two reports without a single piece of data is confusing. Should you believe in Google Analytics or dimension analysis? If you're still worried about this problem, you'll have to stop. Because there is no tool to ensure that the data it collects is hundred percent accurate, limited error is unavoidable. Below, if you have a different perspective, you may find common information in these two reports:

As can be seen, two tools statistics have a similar trend: after the October 1 national Day holiday short traffic downturn, starting from 5th began to recover gradually. It is much more meaningful to analyze the reasons behind this trend than to hold down the number of the day. Even if you have exactly the right numbers, these numbers are meaningless if you can't find the information available in these numbers for decision making, but the trend is to get you on the right course in the ocean of numbers.

Iv. How to get what you want

1. Placement of JavaScript tags

The JavaScript tag data collection principle determines whether the data can be collected and the data collected is not what you want, and it relies on the JavaScript tag code to execute correctly; it also means that if the data collection link goes into the wrong place, it will bring an irreversible impact to the resulting analytical work. (visitors will not be able to reproduce the historical access process for your data collection errors).

To place JavaScript tags you should at least note:

(1) You can't miss any pages you want to count

JavaScript tags are different from Web log collection data, and if you miss a page, you will lose all visitors ' action records on this page.

(2) Try to put the tag at the end of the page code

Since the visitor downloads the page code from top to bottom, the execution of the JavaScript markup code may not only delay the presentation of the page, but also cause the page to fail when the data collection server fails. So in order not to affect the quick and normal display of the page, you should try to put the tag at the end of the page code (usually before)

Of course, in order to do some special statistics (such as the page link click), or need to put the tag on the head, so that in the page can be defined in the markup of the normal call.

2. Unique identification of the page

In principle, the URL of the page is to distinguish between different pages of the logo, but because of the application of dynamic pages, and so on, even the same page may be due to different parameters or the case of inconsistent results are statistically different pages, which directly to the analysis brought trouble. The following is a sample of reports that the same page is counted as multiple pages:

To uniquely identify the page, you can take the following actions:

(1) Set the default page

If www.example.com and www.example.com/index.html both point to your site's default page, you can avoid being counted into two separate pages after setting the default page.

The General Analysis tool provides an interface to the default page, and there are no more one by one details to be set up.

(2) Uniform URL case

Because the Google Analytics and other analysis tools will be inconsistent with the case of the URL of the same page statistics to different URLs of the page, in order to avoid this situation, you can set the statistics after the filter, the URL will be unified into uppercase or lowercase (the general analysis tool will provide data filtering settings function).

(3) filter out the extra parameters in the URL

Because the application of the Dynamic Web page may have different parameters after the same page URL, this causes the analysis tool to count the URLs with different parameters as several different pages. This statistical error can be avoided by filtering settings on specific parameters. For example, setting excludes the TestID parameter in the example above, the/item.php?testid=1 and/item.php?testid=2 can be counted as the same page.

3. Filter excess data

Reference IP filtering

In order to eliminate the access traffic of oneself or tester, this part of traffic can be excluded by filtering IP.

(2) Subdomain filtering

When you are only concerned about the traffic situation of a subdomain, you can include only this portion of traffic through the child domain filtering settings.

These are just two common filtering settings, and often tools provide a variety of filtering settings to meet different requirements.

Attached: Technical parameters that should be learned from the Web analytics tool maker

1. Effective access time (typically 30 minutes)

2. The time of day of forced-closing visits (whether or not exceeding the effective duration of the visit) will be forcibly cut off; usually in the wee hours of the hour

3. To determine the effective time of the visitor cookie (repeated visits are recognized as re-visitors during this time; typically one year or two years)

4. Page last stay (generally defaults to 1 minutes or 0 minutes; If the tool can collect this data you need to consult the specific collection method)

(Copyright GUI Lin website Analysis blog All, welcome to reprint, but reprint please specify the source.) )

Original: Http://blog.digitalforest.cn/wangzhanfenxi-shuju-buwanmei

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More