Use PHP to make Web data analysis into a higher realm

Source: Internet
Author: User
Tags continue count require set time sort time limit

Design your data analysis, do more things than simple original count

Effective and multi-level analysis of web data is a key factor in the survival of many web-enabled enterprises, the design (and decision) of data analysis validation is often the work of system administrators and internal application designers, who may have no more knowledge of statistics than to be able to make the original count into tables. In this article, Paul Meagher teaches Web developers the skills and concepts needed to apply inferential statistics to web data streams.

Dynamic Web sites generate a lot of data-access logs, poll and survey results, customer profiles, orders, and more-web developers work not only to create applications that generate this data, but also to develop applications and methods that make sense for these streams of data.

Typically, WEB developers do not have enough to respond to the growing data analysis requirements generated by the management site. In general, there is no better way for WEB developers to reflect data flow characteristics than to report various descriptive statistics. There are many inferential statistical steps (methods for estimating the overall parameters based on the sample data) that can be fully exploited, but are not currently applied.

For example, WEB access statistics, as currently edited, are just a number of frequency counts that are grouped in various ways. The results of opinion polls and surveys in terms of their original count and percentages abound.

The statistical analysis that developers use to handle data streams in a more superficial way may be sufficient, and we should not expect too much. After all, there are professionals involved in more complex data flow analysis, and they are statisticians and trained analysts. When organizations need more than descriptive statistics, they can invite them to join.

But another response is to admit that a growing understanding of inferential statistics is becoming part of the Web developer's job description. Dynamic sites are generating more and more data, and it turns out that trying to turn this data into useful knowledge is the responsibility of WEB developers and system administrators.

I advocate the latter response; This article is intended to help Web developers and system administrators learn (or revisit, if knowledge has forgotten) the design and analysis skills needed to apply inferential statistics to WEB data streams.

Making WEB data relevant to experimental design

Applying inferential statistics to WEB data streams requires more than just learning to be a mathematical knowledge based on a variety of statistical tests. The ability to correlate data collection processes with key differences in experimental design is also important: what is the measurement scale? What is the representativeness of the sample? What is the overall? What is the hypothesis being tested?

To apply inferential statistics to WEB data streams, it is necessary to think of the results as being generated by experimental design, and then select the analytical process that applies to the design of the experiment. Even though you may think that it is superfluous to consider WEB polls and access log data as the result of an experiment, it is really important to do so. Why?

1. This will help you to select the appropriate statistical test method.
2. This will help you draw appropriate conclusions from the data collected.

One important aspect of experimental design in determining which appropriate statistical tests to use is to select metrics for data collection.

Examples of metrics

The measurement scale simply specifies a step to assign a symbol, letter, or number to the phenomenon of interest. For example, the kilogram scale allows you to assign numbers to an object, indicating the weight of the object according to the standardized offset of the measuring instrument.

There are four important metrics:

The scaling scale (ratio)-kilogram scale is an example of a scaling scale? The symbols assigned to an object's properties have a numeric meaning. You can perform various operations on these symbols, such as calculation ratios, and you cannot use these operations for values obtained by using less powerful metrics.

Fixed-distance scale (interval)-The distance (also known as spacing) between any two adjacent units of measurement is equal at the fixed-distance scale, but 0 points are arbitrary. Examples of distance scales include measurements of longitude and tidal heights, and measurements of the beginning and the beginning of different years. The value of the fixed-distance scale can be added and decreased, but the multiplication and division is meaningless.

The sequence scale (rank)-order scale can be applied to a set of sequential data, in which the values and observations belonging to the scale can be ordered or accompanied by a rating scale. Common examples include "likes and dislikes" polls, which assign numbers to individual attributes (from 1 = very dislike to 5 = very much). Generally, a group of ordered data categories have a natural order, but the difference between adjacent points on the scale does not have to be the same. For sequential data, you can count and sort, but not measure.

Fixed-class scale (nominal)-a standard-determining scale is the weakest form of a measure, mainly referring to assigning items to groups or categories. This measurement has no quantitative information and does not indicate that the item is sorted. The primary numerical operation for a fixed-class scale data is the frequency count of items in each category.
The following table compares the characteristics of each of these metrics:

Does measuring standard scale attribute have absolute numerical meaning? Can you perform most mathematical operations?
The scaling scale is. Is.
The fixed distance scale is the same for the fixed distance scale, and 0 points are arbitrary. Plus and minus.
The order scale is not. Count and sort.
The definite class scale is not. can only count.

In this article, I will focus on the data collected by using a fixed scale of measurement, and the inference techniques that apply to fixed-class data.

Using a definite class scale

Almost all Web users-designers, customers, and system administrators-are familiar with the scaling scale. Web polls and access logs are similar because they often use a fixed-class metric as a metric. In Web polls, users often ask people to choose the answer option, such as "Do you prefer brand A, brand B, or brand C?" ") to measure people's preferences. Summarize the data by counting the frequency of each answer.

Similarly, the usual way to measure web traffic is to divide each click or visit within a one-week period into one day, and then count the number of clicks or accesses that appear each day. In addition, you can (and indeed can) count the hits by browser type, operating system type, and the country or region where the visitor is located-and any category scale you want.

Because Web polls and access statistics need to count the number of times the data is grouped into a particular category of properties, you can analyze them using a similar nonparametric statistical test (which allows you to make inferences based on the distribution shape rather than the overall parameters).

In his book Handbook of Parametric and Non-parametric statistical procedures (page 19th, 1997), David Sheskin distinguishes between parametric and nonparametric tests:

The differences used in this book to classify processes into parametric and nonparametric tests are based primarily on the level of measurement represented by the data being analyzed. As a general rule, the inference statistical tests for the assessment category/fixed-class scale data and order/rank-order data are classified as nonparametric tests, while those that evaluate the fixed-scale data or the calibration scale data are classified as parameter tests.

Nonparametric testing is also useful when some assumptions that are the basis for parameter testing are questionable; When the parameter assumptions are not met, the Nonparametric test has a great effect on the overall difference detection. For the Web poll example, I used the nonparametric parsing process, because WEB polls usually use a fixed-class scale to record voter preferences.

I'm not suggesting that Web polls and Web Access statistics should always use a standard metric, or that nonparametric statistical testing is the only way to analyze such data. It is not difficult to envisage (for example) such polls and surveys, which require the user to provide a numerical rating (from 1 to 100) for each option, so that the statistical examination of the parameters is more appropriate.

However, many WEB data streams include editing category count data, and by defining a fixed-distance scale (for example, from 17 to 21) and assigning each data point to a fixed scale (such as "Young people"), the data can be converted to a fixed-scale data (by using a more powerful measurement of metrics). The prevalence of frequency data (already part of WEB developer experience) makes focusing on nonparametric statistics a good starting point for learning how to apply inferential techniques to data flow.

To keep this article reasonable, I'll confine my discussion of Web data flow analysis to web polls. But keep in mind that many WEB data streams can be represented by a set class count data, and the inference technique I discuss will allow you to do more than report simple counting data.

Start with the sample

Let's say you www at your site A weekly poll is conducted on the novascotiabeerdrinkers.com to ask Members for advice on a variety of subjects. You have created a poll asking members about their favorite beer brands (Nova Scotia in Nova Scotia Prov., Canada) with three well-known beer brands: Keiths, Olands and schooner. In order to make the survey as wide as possible, you include "other" in your answer.

You receive 1,000 answers, please observe the results in table 1. (The results shown in this article are for demonstration purposes only and are not based on any actual investigation.) )

Table 1. Beer poll Keiths olands schooner Other
285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The data seem to support the conclusion that Keiths is the most popular brand among Nova Scotia Prov. residents. Based on these figures, can you come to this conclusion? In other words, can you infer from the results obtained from a sample the overall Nova Scotia Prov. beer consumers?

Many of the factors associated with sample collection methods make the inference of a relatively popular degree incorrect. The sample may contain too many employees of the Keiths brewery; You may not be able to completely prevent a person from casting a number of votes, and that person may make a difference in the outcome; perhaps the person chosen to vote differs from the person who has not been singled out for a vote; perhaps the voters on the internet are different from those who do not.

Most Web polls have difficulties with these explanations. These explanations are difficult when you try to draw conclusions about the overall parameters from the sample statistics. From an experimental design standpoint, the first question to ask before collecting data is whether steps can be taken to help ensure that the sample represents the overall study.

If the overall conclusion of the study is that you are motivated to do a Web poll (rather than a distraction for site visitors), then you should implement some techniques to ensure that one person and one vote (so they must log on with a unique identity to vote) and make sure to randomly select the voter sample (for example, randomly select a subset of members, Then email them and encourage them to vote.

Ultimately, the goal is to eliminate (at least reduce) deviations that may weaken the ability to draw conclusions on the overall study.

Test assumptions

Assuming that the Nova Scotia Prov. Beer Consumer statistics sample has not deviated, can you now draw the conclusion that Keiths is the most popular brand?

To answer this question, consider a related question: if you want to get a sample of another Nova Scotia Prov. beer consumer, do you want to see exactly the same result? In fact, you would expect that the results observed in different samples would change somewhat.

Considering this expected variability of the sample, you may wonder whether it is better to demonstrate the observed brand preference by randomly sampling variability than by reflecting the actual differences in the overall study. In statistical scholarly language, this sample variability description is called a false set (null hypothesis). (False set by symbol Ho) In this case, it is represented as a statement with a formula: The expected number of responses is the same in all categories of responses.

ho:# Keiths = # olands = # schooner = # Other

If you can rule out imaginary assumptions, you have made some progress in answering the initial question whether Keiths is the most popular brand. Another acceptable assumption, then, is that the percentages of responses are different in the overall study.

This "first Test false" logic applies in many stages of the analysis of poll data. Exclude this imaginary assumption so that the data will not be completely different, and then you can continue to examine a more specific false setting, namely Keiths and schooner, or Keiths and all other brands.

You continue to examine false sets rather than directly evaluate another hypothesis, because it is easier to model the things people want to see in the virtual assumptions. Next, I'll demonstrate how to model the things I expect in a false setting so that I can compare the observations with the results I expect in a virtual hypothesis.

Modeling a false setting: X-squared distribution statistics

So far, you have used a table that reports the frequency counts (and percentages) of each answer option to summarize the results of the WEB poll. To examine the false set (no difference between the frequency of the table cells), it is much easier to calculate the overall deviation metric for each table cell than you expect under virtual assumptions.

In this sample beer popularity poll, the expected frequencies under virtual assumptions are as follows:

Expected frequency = Number of observations/answer options
Expected frequency = 1000/4
Expected frequency = 250

To calculate the overall measure of how much of an answer in each cell differs from the expected frequency, you can sum all the differences to a total measure of how much difference between the observed frequency and the desired frequency: (285-250) + (250-250) + (215-250) + (250-250).

If you do this, you will find that the expected frequency is 0, because the average deviation is always 0. To solve this problem, you should take the square of all the difference values (this is the origin of the square in the X-squared distribution (Chi square). Finally, to make the value of each sample (which has different observations) comparable (in other words, standardizing it), divide the value by the desired frequency. Therefore, the formula for the X square distribution statistic is as follows ("O" means "observation frequency", "E" equals "desired frequency"):
Figure 1. The formula of X square distribution statistics

If you calculate the X-squared distribution of the beer popularity poll data, you get a value of 9.80. To test the virtual hypothesis, it is necessary to know the probability of obtaining such a limit if there is a random sampling variability. To get this probability, we need to understand what the sampling distribution of the X-squared distribution is.

In each diagram, the horizontal axis represents the resulting X square distribution value (the range shown in the figure is 0 to 10). The vertical axis shows the probability of each X squared distribution value (or the relative frequency of occurrence). When you look at these X-squared maps, be aware that the shape of the probability function changes when you change the degree of freedom (that is, DF) in the experiment. For an example of a poll data, the degree of freedom is calculated by taking the number of answer options (k) in the poll and then using this value minus 1 (df = k-1).

Typically, when you increase the number of answer options in an experiment, the probability of getting a larger X-squared distribution is reduced. This is because when you increase the answer option, you increase the number of variance values-(observation-expected value) 2-you can ask for its total. Therefore, when you increase the answer option, the statistical probability of getting a large X square value should be increased, while the probability of obtaining a smaller X-squared distribution is reduced. This is why the shape of the sample distribution of the X-squared distribution varies with the DF value.

Also, note that people are usually not interested in the decimal portion of the X squared distribution result, but rather on the total portion of the curve to the right of the obtained value. The mantissa probability tells you whether it is possible (such as a large mantissa area) to obtain a limit value like the one you have observed or not (a small mantissa area). (Actually, I don't use these graphs to compute the mantissa probability, because I can implement a mathematical function to return the mantissa probability for a given X squared distribution.) This approach is used in the X square distribution program that I discussed later in this article. )

To learn more about how these graphs derive, you can see how to simulate the contents of a graph corresponding to DF = 2, which represents k = 3. Imagine putting numbers 1, 2, and 3 in a hat, shaking it, selecting a number, and then recording the selected number as an attempt. Try the experiment 300 times, and then calculate the frequency of the 1, 2, and 3 occurrences.

Each time you do this experiment, you should expect the results to have a slightly different frequency distribution, which reflects the variability of the sample, and this distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can initialize the class with the following values: The number of times to experiment, the number of attempts in each experiment, and the number of options per trial. The results of each experiment are recorded in an array named outcomes.

Listing 1. Content of the Multinomial class

<?php

multinomial.php

Copyright 2003, Paul Meagher
Distributed under LGPL

Class Multinomial {

var $NExps;
var $NTrials;
var $NOptions;
var $Outcomes = array ();

function multinomial ($NExps, $NTrials, $NOptions) {
$this->nexps = $NExps;
$this->ntrials = $NTrials;
$this->noptions = $NOptions;
For ($i =0 $i < $this->nexps; $i + +) {
$this->outcomes[$i] = $this->runexperiment ();
}
}

function Runexperiment () {
$Outcome = Array ();
for ($i = 0; $i < $this->nexps; $i + +) {
$choice = rand (1, $this->noptions);
$Outcome [$choice]++;
}
return $Outcome;
}

}
?>

Note that the Runexperiment method is a very important part of the script, and it guarantees that the choices made in each experiment are random and track what choices have been made in the simulation experiments so far.

In order to find the sampling distribution of the X-squared distribution statistics, only the results of each experiment are obtained and the X-squared distribution statistics of the results are computed. Because of the variability of random sampling, the X-squared distribution statistics vary with the experiment.

The following script writes the X-squared distribution statistics obtained in each experiment to an output file to be represented later in the chart.

Listing 2. Writes the obtained X squared distribution statistics to the output file

<?php

simulate.php

Copyright 2003, Paul Meagher
Distributed under LGPL

Set time limit to 0 so script doesn ' t time out
Set_time_limit (0);

Require_once ". /init.php ";
Require Php_math. "Chi/multinomial.php";
Require Php_math. "Chi/chisquare1d.php";

Initialization parameters
$NExps = 10000;
$NTrials = 300;
$NOptions = 3;

$multi = new Multinomial ($NExps, $NTrials, $NOptions);

$output = fopen ("./data.txt", "w") OR die ("File won ' t open");
for ($i =0; $i < $NExps; $i + +) {
For each multinomial experiment, does chi Square analysis
$chi = new Chisquare1d ($multi->outcomes[$i]);

Load obtained Chi square value into sampling distribution array
$distribution [$i] = $chi->chisqobt;

Write obtained Chi square value to file
Fputs ($output, $distribution [$i]. " \ n ");
}
Fclose ($output);

?>

To visualize the desired results of running this experiment, the easiest way for me to do this is to load the Data.txt file into the open source statistics packet R, run the histogram command, and edit the chart in the graphics editor as follows:

x = Scan ("Data.txt")
hist (x, 50)

As you can see, the histogram of these X-squared distributions is similar to the distribution of the continuous x-squared distribution of DF = 2 above.

Figure 3. The value of the continuous distribution approximation with the df=2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.