Source: Internet
Author: User

Keywords
realm
entry
distribution
data
square
Web
quiz
public opinion
results
unified

Count your data analysis, do more things than simple primitive counting

Effective and multi-layered analysis of web data is a key factor in the survival of many web-oriented enterprises, and the design (and decision-making) of data analysis and validation is often the work of system administrators and internal application designers, who may not have much knowledge of statistics in addition to being able to form raw counts. In this article, Paul Meagher teaches Web developers the skills and concepts they need to apply inferential statistics to web traffic.

Dynamic websites generate large amounts of data-access logs, polls and surveys, customer profiles, orders, and more-and WEB developers work not only to create the applications that generate the data, but also to develop applications and methods that make sense for those data flows.

Typically, the WEB developer's response is not enough for the growing data analysis needs that are generated by the management site. In general, there is no better way for WEB developers to reflect data flow characteristics than to report various descriptive statistics. There are many inferential statistical steps (methods for estimating overall parameters based on sample data) that can be fully exploited, but are not currently applied.

For example, WEB access statistics (as currently being edited) are simply a count of how often they are grouped in various ways. The results of polls and surveys are all too prevalent in the original count and percentage.

It may be enough for developers to handle the statistical analysis of data streams in a more superficial way, and we should not expect too much. After all, there are professionals who engage in more sophisticated data stream analysis; they are statisticians and trained analysts. When organizations need more than just descriptive statistics, they can be asked to join.

But another response is to acknowledge that the growing understanding of inferential statistics is becoming part of the job description of WEB developers. Dynamic sites are generating more and more data, and it turns out that trying to turn this data into useful knowledge is the responsibility of WEB developers and system administrators.

I advocate the latter response; This article is intended to help WEB developers and system administrators learn (or revisit, if knowledge is forgotten) the design and analysis skills needed to apply statistics to web traffic.

Making WEB data relevant to experimental design

Applying inferential statistics to WEB traffic requires more than learning the mathematical knowledge that is the basis of various statistical tests. The ability to correlate the data collection process with the key differences in the experimental design is also important: what is the measurement scale? What is the representativeness of the sample? What is the overall? What are the assumptions being tested?

To apply inferential statistics to WEB traffic, you need to think of the results as being generated by an experimental design, and then choose the analysis process that applies to the design of the experiment. Even though you might think that the results of WEB polls and access log data as experiments are superfluous, it's really important to do so. Why?

1. This will help you select the appropriate statistical test method.

2. This will help you draw the appropriate conclusions from the collected data.

In determining which appropriate statistical tests to use, an important aspect of the design of the experiment is the choice of metrics for data collection.

Examples of measurement criteria

The measurement scale simply specifies a procedure for assigning symbols, letters, or numbers to a phenomenon of interest. For example, the kilogram scale allows you to assign a number to an object, indicating the weight of the object according to the normalized offset of the measuring instrument.

There are four important metrics:

Fixed scale (ratio)-kilogram scale is an example of a fixed scale? The symbol assigned to an object's properties has a numeric meaning. You can perform various operations on these symbols, such as calculation ratios, and you cannot use these operations for values obtained by using less powerful metrics.

Fixed-distance scale (interval)-The distance (also known as the spacing) between any two adjacent units of measurement is equal in the fixed-distance scale, but 0 points are arbitrary. Examples of distance scales include measurements of longitude and tidal heights, and measurements of the beginning and ending of different years. The value of the fixed distance scale can be added and reduced, but multiplication is meaningless.

The order scale (rank)-The sequence scale can be applied to a set of sequential data, the sequence refers to the values and observations belonging to the scale can be ordered or accompanied by a rating scale. Common examples include "likes and dislikes" polls, where numbers are assigned to individual attributes (from 1 = very disgusted to 5 = very much). Typically, a group of ordered data has a natural order, but the gap between adjacent points on the scale does not have to be the same. For sequential data, you can count and sort, but not measure.

Categorical scale (nominal)-the scale of the measurement standard is the weakest form of measurement, mainly refers to assigning items to groups or categories. This measurement does not have quantity information and does not indicate that the item is sorted. The main numerical operations performed on the categorical scale data are the frequency counts of items in each category.

The following table compares the characteristics of each of the metrics:

Does the metric scale attribute have an absolute numerical meaning? Is it possible to perform most mathematical operations?

The scale of the scaling is. Is.

The distance scale is the same for the fixed distance, and 0 points is arbitrary. Add and subtract.

The order scale is not. Count and sort.

The definite class scale is not. can only be counted.

In this article, I will focus on the data collected by using the measured scaling scale, as well as the inference techniques applicable to the fixed data.

Using the fixed class scale

Almost all Web users-designers, customers, and system administrators-are familiar with the scaling of classes. Web polls are similar to access logs because they often use the class scale as a measure. In Web polls, users often ask people to choose the answer option (such as "Do you prefer brand A, brand B, or brand C?"). ") to measure people's preferences. Summarize data by counting the frequency of each type of answer.

Similarly, a common method of measuring site traffic is to divide each click or visit from one day within one weeks to the day, and then count the number of clicks or visits that appear on each day. In addition, you can (and do) the browser type, the type of operating system, and the country or region where the visitor is located-and any classification scale you want-to count clicks.

Because Web polls and access statistics need to count the number of times the data is grouped into a particular category of properties, you can analyze them with similar nonparametric statistical tests that allow you to make inferences based on the distribution shape rather than the overall parameters.

David Sheskin in his handbook of parametric and Non-parametric statistical procedures (p. 19th, 1997) is such a distinction between parametric and non-parametric tests:

The distinction used in this book to classify processes as parametric and nonparametric tests is based primarily on the level of measurement represented by the data being analyzed. As a general rule, the evaluation category/categorical scale data and sequence/rank-to-order data are classified as nonparametric tests, while those that evaluate the fixed-scale data or the scale data are classified as parameter tests.

Nonparametric tests are also useful when certain assumptions that are the basis of a parameter test are questionable, and non-parametric tests have a significant effect in detecting the overall difference when the parameter assumptions are not met. For the example of a Web poll, I used the nonparametric analysis process because Web polls typically use a class scale to record voter preferences.

I'm not suggesting that Web polls and Web Access statistics should always use a scaling metric, or that nonparametric statistical testing is the only way to analyze such data. It is not difficult to imagine (for example) such polls and surveys, which require the user to provide a numerical score (from 1 to 100) for each option, and the parametric statistical test is more appropriate.

Nonetheless, many WEB traffic includes editing the category count data, and by defining a distance scale (for example, from 17 to 21) and assigning each data point to a fixed scale (such as "Young people"), this data can be transformed into categorical scale data by using more powerful measurement metrics. The prevalence of frequency data (already part of the experience of WEB developers) makes focusing on nonparametric statistics a good starting point for learning how to apply inference techniques to data flow.

To keep this article reasonable, I'll confine my discussion of web traffic analysis to web polls. Keep in mind, however, that many WEB traffic can be represented by the class count data, and the inference techniques I discuss will allow you to do more than report simple counting data.

Starting from sampling

Suppose you have a weekly poll on your site www.NovaScotiaBeerDrinkers.com, asking members for comments on various topics. You've created a poll that asks members about their favorite beer brands (Nova Scotia Prov. Nova Scotia, Canada, has three well-known beer brands: Keiths, Olands, and schooner). In order to make the survey as wide as possible, you include "other" in your answer.

You receive 1,000 answers, please observe the results in table 1. (The results shown in this article are for demonstration purposes only and are not based on any actual investigation.) ）

Table 1. Beer poll Keiths olands schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

These figures seem to support the conclusion that Keiths is the most popular brand among Nova Scotia Prov. residents. According to these figures, can you come to this conclusion? In other words, can you infer from the results of the sample the overall Nova Scotia Prov. beer consumers?

Many of the factors associated with sample collection methods make the relative popularity inference incorrect. The sample may contain too many employees of the Keiths brewery; Perhaps you have not completely prevented a person from casting multiple votes, and this person may have caused the result to deviate, perhaps the person chosen to vote is different from those who have not been chosen to vote, maybe the Internet voter is different from the voter who is not online.

Most Web polls have difficulties with these interpretations. These explanatory difficulties arise when you try to draw conclusions about the overall parameters from the sample statistics. From an experimental design standpoint, one of the first questions to ask before collecting data is whether steps can be taken to help ensure that the sample represents the overall study.

If the overall conclusion of the study is that you are motivated to do a Web poll (rather than a pastime for site visitors), then you should implement some techniques to ensure one-person-one-vote (so that they must sign in to vote with a unique identity) and ensure that the voter sample is randomly selected (for example, a subset of randomly selected members). Then email them and encourage them to vote).

Ultimately, the goal is to eliminate (at least reduce) deviations, which may weaken the ability to draw conclusions about the overall study.

Test hypothesis

Assuming the Nova Scotia Prov. Beer Consumer statistics sample is not biased, can you now conclude that Keiths is the most popular brand?

To answer this question, consider a related question: if you want to get a sample of another Nova Scotia Prov. beer consumer, do you want to see the exact same result? In fact, you'll want to see some variation in the results observed in different samples.

Considering this expected sample variability, you may wonder whether the observed brand preference can be better illustrated by the fact that the variability in random sampling is more than the actual difference reflected in the overall study. In the statistical academic language, this sample variability statement is called false set (null hypothesis). (False set by symbol Ho) In this example, it is represented by a formula as a statement: in all categories of the answer, the desired number of answers is the same.

ho:# Keiths = # olands = # schooner = # Other

If you can exclude virtual assumptions, you have made some progress in answering the initial question whether Keiths is the most popular brand. Another acceptable assumption, then, is that the proportions of the responses are different in the overall study.

This "first Test false set" logic is applicable in many stages of the analysis of poll data. Exclude this imaginary hypothesis so that the data will not be completely different, and then you can continue to test for a more specific false setting, that is, keiths and schooner, or Keiths no difference from all other brands.

You continue to test the false setting instead of directly evaluating another hypothesis because it is easier to model the statistical modeling of things that people want to observe under virtual assumptions. Next, I will demonstrate how to model what is expected in a false setting so that I can compare the observations with the desired results under the imaginary assumptions.

Modeling of false setting: X-squared distribution statistics

So far, you've used a table that reports frequency counts (and percentages) for each answer option to summarize the results of a Web poll. To test for spurious setting (there is no difference between table cell frequencies), it is much easier to calculate the overall deviation metric that each table unit expects from your imaginary assumptions.

In this example of the beer welcome poll, the expected frequency under virtual assumptions is as follows:

Expected frequency = Number of observations/answer options

Desired frequency = 1000/4

Desired frequency = 250

To calculate the overall measure of how much the content in each cell differs from the desired frequency, you can sum all the differences to a total measure that reflects how much the observed frequency differs from the desired frequency: (285-250) + (250-250) + (215-250) + (250-250).

If you do this, you will find that the desired frequency is 0 because the deviation of the mean and always is 0. To solve this problem, the square of all the differences should be taken (this is the origin of the square in the X-squared distribution (Chi square)). Finally, to make the value of each sample (which has a different number of observations) comparable (in other words, normalize it), divide the value by the desired frequency. Therefore, the formula for the X-squared distribution statistic is as follows ("O" means "observed frequency", "E" equals "desired frequency"):

Figure 1. Formula of X-squared distribution statistics

If you calculate the X-squared distribution of the beer Welcome poll data, you get a value of 9.80. To test the virtual hypothesis, you need to know the probability of obtaining such a limit value if there is a random sampling variability. To get this probability, we need to understand what the sample distribution of the X-squared distribution is.

Observing the sampling distribution of the X-squared distribution

Figure 2. X-Squared distribution map

In each picture, the horizontal axis represents the resulting size of the X-squared distribution (the range shown in the figure is from 0 to 10). The vertical axis shows the probability of each X squared distribution value (or the relative frequency that appears).

When you study these X-squared distributions, be aware that when you change degrees of freedom (that is, DF) in your experiment, the shape of the probability function changes. For an example of poll data, the degree of freedom is calculated by writing down the number of answer options (k) in the poll and then using this value minus 1 (df = k-1).

Typically, when you increase the number of answer options in your experiment, the probability of getting a large X-squared distribution value decreases. This is because when you increase the answer option, you increase the number of variance values-(observation value-expectation) 2-you can ask for the total number of it. Therefore, when you increase the answer option, the statistical probability of obtaining a large X-squared distribution value should be increased, and the probability of obtaining a smaller X-squared distribution value will be reduced. This is why the shape of the sample distribution of the X-squared distribution varies with the DF value.

In addition, it is important to note that people are not interested in the fractional part of the result of the X-squared distribution, but rather on the total portion of the curve to the right of the obtained value. The mantissa probability tells you whether it is possible (such as a large mantissa area) or not (a small mantissa area) to get a limit value as you observe. (In fact, I don't use these graphs to calculate the mantissa probability, because I can implement a mathematical function to return the mantissa probability of a given X-squared distribution value.) This approach is used in the X-squared distribution program that I discussed later in this article. ）

To learn more about how these graphs are derived, you can see how to simulate the contents of a graph corresponding to DF = 2 (which represents k = 3). Imagine putting numbers 1, 2, and 3 in your hat, shaking it, selecting a number, and recording the selected number as an attempt. Try the experiment 300 times and then calculate the frequency of 1, 2, and 3.

Each time you do this experiment, you should expect the result to have a slightly different frequency distribution, which reflects the variability of the sample, and the distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can initialize the class with the following values: The number of times to do the experiment, the number of attempts made in each experiment, and the number of options for each experiment. The results of each experiment are recorded in an array named Outcomes.

Listing 1. Contents of the Multinomial class

multinomial.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Class Multinomial {

var $NExps;

var $NTrials;

var $NOptions;

var $Outcomes = array ();

function multinomial ($NExps, $NTrials, $NOptions) {

$this->nexps = $NExps;

$this->ntrials = $NTrials;

$this->noptions = $NOptions;

for ($i =0; $i < $this->nexps; $i +) {

$this->outcomes[$i] = $this->runexperiment ();

}

}

function Runexperiment () {

$Outcome = Array ();

for ($i = 0; $i < $this->nexps; $i + +) {

$choice = rand (1, $this->noptions);

$Outcome [$choice]++;

}

return $Outcome;

}

}

?>

Note that the Runexperiment method is a very important part of the script, which guarantees that the choices made in each experiment are random, and tracks what choices have been made so far in the simulation experiment.

In order to find the sampling distribution of the X-squared distribution, we simply obtain the results of each experiment and calculate the X-squared distribution statistics of the results. Because of the variability of random sampling, the X-squared distribution statistic varies with the experiment.

The following script writes the X-squared distribution statistics obtained for each experiment to an output file for later representation in a chart.

Listing 2. Writes the obtained X-squared distribution statistics to the output file

simulate.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Set time limit to 0 so script doesn ' t time out

Set_time_limit (0);

Require_once ". /init.php ";

Require Php_math. "Chi/multinomial.php";

Require Php_math. "Chi/chisquare1d.php";

Initialization parameters

$NExps = 10000;

$NTrials = 300;

$NOptions = 3;

$multi = new Multinomial ($NExps, $NTrials, $NOptions);

$output = fopen ("./data.txt", "w") or Die ("file won ' t open");

for ($i =0; $i < $NExps; $i + +) {

For each multinomial experiment, does chi Square analysis

$chi = new Chisquare1d ($multi->outcomes[$i]);

Load obtained Chi square value into sampling distribution array

$distribution [$i] = $chi->chisqobt;

Write obtained Chi square value to file

Fputs ($output, $distribution [$i]. " n ");

}

Fclose ($output);

?>

To visualize the results that are expected to run the experiment, the simplest way for me to do this is to load the Data.txt file into the Open source statistics package R, run the histogram command, and edit the diagram in the graphics editor as follows:

x = Scan ("Data.txt")

hist (x, 50)

As you can see, the histogram of these x-squared distribution values approximates the distribution of the continuous x-squared distribution of DF = 2, which is represented above.

Figure 3. The approximate value of the continuous distribution with the df=2

In the following sections, I will focus on how the X-squared distribution software used in this simulation experiment works. Typically, the X-squared distribution software will be used to analyze actual scaling data (such as WEB poll results, weekly traffic reports, or customer brand preference reports) rather than the simulation data you use. You may also be interested in other outputs generated by the software-such as summary tables and mantissa probabilities.

Instance variables of the X-squared distribution

The PHP-based X-squared distribution package I developed consists of classes used to analyze frequency data, and frequency data is categorized by one-dimensional or two-dimensional (chisquare1d.php and chisquare2d.php). My discussion will be limited to explaining how the Chisquare1d.php class works and how to apply it to one-dimensional WEB poll data.

Before proceeding, it should be stated that classifying data according to two dimensions (for example, classifying beer preferences by gender) allows you to begin to describe your results by looking up system relationships or conditional probabilities in a list of linked tables. While many of the following discussions will help you understand how the chisquare2d.php software works, other experiments, analysis, and visualization issues that are not discussed in this article are also required to be handled before using this class.

Listing 3 studies the fragment of the Chisquare1d.php class, which is made up of the following components:

1. A contained document

2. Class instance variables

Listing 3. Fragment of the X-squared distribution class with included file and instance variables

chisquare1d.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Require_once Php_math. "Dist/distribution.php";

Class Chisquare1d {

var $Total;

var $ObsFreq = array (); Observed frequencies

var $ExpFreq = array (); Expected frequencies

var $ExpProb = array (); Expected probabilities

var $NumCells;

var $ChiSqObt;

var $DF;

var $Alpha;

var $ChiSqProb;

var $ChiSqCrit;

}

?>

The top part of the script in Listing 3 contains a file named distribution.php. The included paths combine the Php_math constants set in the init.php file, assuming that the init.php file is already included in the calling script.

The included file distribution.php contains a method for generating sample distribution statistics for several commonly used sampling distributions (T distributions, F distributions, and X-squared distributions). The chisquare1d.php class must be able to access the X-squared distribution method in distribution.php to calculate the mantissa probability of the resulting x-squared distribution value.

The list of instance variables in this class is noteworthy because they define the result objects that are generated by the parsing process. This result object contains all the important details about the test, including three important X-squared distribution statistics-CHISQOBT, Chisqprob, and Chisqcrit. For more information about how to calculate each instance variable, you can look at the constructor method for that class, all of which are derived from there.

Constructor: The backbone of the X-squared distribution test

Listing 4 shows the constructor code for the X-squared distribution, which forms the backbone of the X-squared distribution test.

Listing 4. The constructor of the X-squared distribution

Class Chisquare1d {

function chisquare1d ($ObsFreq, $Alpha =0.05, $ExpProb =false) {

$this->obsfreq = $ObsFreq;

$this->expprob = $ExpProb;

$this->alpha = $Alpha;

$this->numcells = count ($this->obsfreq);

$this->DF = $this->numcells-1;

$this->total = $this->gettotal ();

$this->expfreq = $this->getexpfreq ();

$this->chisqobt = $this->getchisqobt ();

$this->chisqcrit = $this->getchisqcrit ();

$this->chisqprob = $this->getchisqprob ();

return true;

}

}

?>

The four notable aspects of a constructor method are:

1. The constructor accepts an array of observed frequencies, an alpha probability break point (cutoff score), and an optional array of expected probabilities.

2. The first six lines involve relatively simple assignments and computed values that are recorded so that the complete result object can be used to invoke the script.

3. The last four lines perform a lot of work to get the X-squared distribution statistics that you are most interested in.

4. This class only implements the X-squared distribution test logic. There is no output method associated with the class.

You can study the class methods included in the code download for this article to learn more about how to calculate the value of each result object (see Resources).

Handling Output issues

The code in Listing 5 shows how easy it is to perform an X-squared distribution analysis using the Chisquare1d.php class. It also demonstrates the processing of the output problem.

The script invokes a wrapper script named chisquare1d_html.php. The purpose of this wrapper script is to separate the logic of the X-squared distribution process from its representation. The _html suffix indicates that the output is for a standard Web browser or other device that displays HTML.

Another purpose of wrapper scripting is to organize the output in a way that facilitates understanding of the data. To achieve this, the class contains two methods for displaying the results of an X-squared distribution analysis. The Showtablesummary method shows the first output table shown behind the Code (table 2), and Showchisquarestats shows the second output table (table 3).

Listing 5. Organizing data with wrapper scripts

beer_poll_analysis.php

Require_once ". /init.php ";

Require_once Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (285, 250, 215, 250);

$Alpha = 0.05;

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha);

$Chi->showtablesummary ($Headings);

echo "

";

$Chi->showchisquarestats ();

?>

The script generates the following output:

Table 2. Expected frequency and variance for running wrapper scripts

Keiths olands Schooner Other Total

Observation Value 285 250 215 250 1000

Expected value 250 250 250 250 1000

Variance 4.90 0.00 4.90 0.00 9.80

Table 3. Various X-squared distribution statistics obtained by running the wrapper script

Probability critical value of DF acquisition value

X-squared Distribution 3 9.80 0.02 7.81

Table 2 shows the desired frequency and the variance measure (O-E) 2/e for each unit. The sum of the variance values equals the obtained X-squared distribution (9.80) value, which is displayed in the lower-right cell of the summary table.

Table 3 reports the various X-squared distribution statistics. It includes the degrees of freedom used in the analysis, and again reports the obtained X-squared distribution value. The obtained X-squared distribution value is re-expressed as the mantissa probability value-in this case, 0.02. This means that, under virtual assumptions, the probability of observing an X-squared distribution limit of 9.80 is 2% (which is a fairly low probability).

If you decide to exclude false sets-the results can be obtained according to the random sampling variability of the 0 distribution, then most statisticians will not be controversial. Your poll results are more likely to reflect the real difference in Nova Scotia Prov. beer Consumers ' overall preference for beer brands.

To confirm this conclusion, the obtained X-squared distribution value can be compared with the critical value.

Why is critical value important? The threshold value is based on an important level (that is, the alpha break level) set for the analysis. The Alpha break value is set to 0.05 by convention (this is the value used by the above analysis). This setting is used to find the position (or threshold) where the mantissa area equals the alpha break value (0.05) in the sampling distribution of the X-squared distribution.

In this article, the obtained X-squared distribution value is greater than the critical value. This means that the threshold for keeping the false setting is exceeded. Another hypothesis-there is a proportional difference in the overall object-may be more statistically correct.

In automated analysis of data flow, Alpha break settings can set output filtering for knowledge-discovery algorithms such as automatic interaction detection of X-squared distributions (Chi square Automatic Interaction detection,chiad). Such algorithms themselves are unable to provide detailed guidance on discovering truly useful patterns.

A re-poll

Another interesting application of the one-way X-squared distribution test is to re-poll to see if people's answers have changed.

Suppose that after a while, you plan to conduct another WEB poll of Nova Scotia Prov. beer consumers. Once again you asked about their favorite beer brands and now observe the following results:

Table 4. A new beer poll

Keiths olands Schooner Other

385 (27.5%) 350 (25%) 315 (22.5%) 350 (25%)

The old data is as follows:

Table 1. Old Beer poll (show again)

Keiths olands Schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The obvious difference between the poll results is that there were 1,000 respondents to the first poll and 1,400 respondents for the second time. The main impact of these additional respondents was to increase the frequency count of each response case by 100 points.

When you are ready to analyze a new poll, you can use the default method-calculate the desired frequency to analyze the data, or you can initialize the analysis with the expected probability of each result (based on the scale observed in the previous poll). In the second case, you load the previously obtained proportions into an array of expected probabilities ($ExpProb) and use them to calculate the desired frequency value for each answer option.

Listing 6 shows the beer poll analysis code for detecting preference changes:

Listing 6. Changes in detection preferences

beer_repoll_analysis.php

Require_once ". /init.php ";

Require Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (385, 350, 315, 350);

$Alpha = 0.05;

$ExpProb = Array (. 285,. 250,. 215,. 250);

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha, $ExpProb);

$Chi->showtablesummary ($Headings);

echo "

";

$Chi->showchisquarestats ();

?>

Tables 5 and 6 show the HTML output generated by the beer_repoll_analysis.php script:

Table 5. Expected frequency and variance obtained by running beer_repoll_analysis.php

Keiths olands Schooner Other Total

Observation value 385 350 315 350 1400

Expected value 399 350 301 350 1400

Variance 0.49 0.00 0.65 0.00 1.14

Table 6. Statistics of various X-squared distribution statistics obtained by running beer_repoll_analysis.php

Probability critical value of DF acquisition value

X-squared Distribution 3 1.14 0.77 7.81

Table 6 shows that under virtual assumptions, the probability of obtaining an X-squared distribution value of 1.14 is 77%. We cannot rule out the illusion that Nova Scotia Prov. Beer consumer preferences have changed since the last poll. Any difference between the observed frequency and the desired frequency can be interpreted as the expected sampling variability of the Nova Scotia Prov. same beer consumer. Given that the initial poll results were converted only by adding a constant of 100 to each of the previous poll results, this zero-discovery should not be surprising.

However, you can imagine that the results have changed and imagine that these results may imply that another brand of beer is becoming more prevalent (note the variance size reported at the bottom of each column in table 5). You can further imagine that this finding has significant financial implications for the brewery under discussion, as pub owners tend to buy the best-selling beers in the pub.

These results will be examined in great detail by the owner of the brewery, who will question the suitability of the analytical process and the experimental methods, and, in particular, the representativeness of the sample. If you're planning a WEB experiment that might have important practical implications, you need to pay equal attention to the experimental methods used to collect the data and the analytical techniques used to derive inferences from the data.

Therefore, this article not only lays a good foundation for you to enhance your effective understanding of WEB data, it also provides recommendations on how to protect your statistical test choices and makes the conclusions obtained from the data more reasonable.

Application-Learned knowledge

In this article, you've learned how to apply inferential statistics to the ubiquitous frequency data used to summarize web traffic, focusing on the analysis of web poll data. However, the simple one-way X-squared distribution analysis process discussed can also be applied effectively to other types of data streams (access logs, survey results, customer profiles, and customer orders) in order to convert raw data into useful knowledge.

When applying inferential statistics to Web data, I also described the prospect of looking at data flow as a web experiment to improve the likelihood of referencing experimental design considerations when making inferences. Generally, you cannot make inferences because you lack sufficient control over the process of data collection. However, you can change this situation if you are more proactive when you apply the design principles of your experiment to the Web data collection process (for example, by randomly selecting voters during your web poll).

Finally, I showed how to simulate the sampling distributions of the X-squared distributions of different degrees of freedom, not just the source. In doing so, the desired frequency for the measurement category is less than 5 (in other words, the small N experiment)-I also demonstrated a workaround (using small $NTrials values to simulate the sample distribution of the experiment) to prohibit the use of the X-squared distribution test. Therefore, I do not just use DF in the study process to calculate the probability of sample results, for a small number of attempts may also need to use the $NTrials value as a parameter to obtain the probability of the observed X-squared distribution results.

It is worthwhile to consider how you might analyze the small N experiment, because you would typically want to analyze your data before data acquisition is complete-when the cost of each observation is expensive, when the observation takes a long time to obtain, or just because you are curious. When trying this level of WEB data analysis, it's a good idea to keep these two questions in mind:

* Do you have reason to infer under small N conditions?

* Simulations help you decide what inferences to get in these environments?
* *

Effective and multi-layered analysis of web data is a key factor in the survival of many web-oriented enterprises, and the design (and decision-making) of data analysis and validation is often the work of system administrators and internal application designers, who may not have much knowledge of statistics in addition to being able to form raw counts. In this article, Paul Meagher teaches Web developers the skills and concepts they need to apply inferential statistics to web traffic.

Dynamic websites generate large amounts of data-access logs, polls and surveys, customer profiles, orders, and more-and WEB developers work not only to create the applications that generate the data, but also to develop applications and methods that make sense for those data flows.

Typically, the WEB developer's response is not enough for the growing data analysis needs that are generated by the management site. In general, there is no better way for WEB developers to reflect data flow characteristics than to report various descriptive statistics. There are many inferential statistical steps (methods for estimating overall parameters based on sample data) that can be fully exploited, but are not currently applied.

For example, WEB access statistics (as currently being edited) are simply a count of how often they are grouped in various ways. The results of polls and surveys are all too prevalent in the original count and percentage.

It may be enough for developers to handle the statistical analysis of data streams in a more superficial way, and we should not expect too much. After all, there are professionals who engage in more sophisticated data stream analysis; they are statisticians and trained analysts. When organizations need more than just descriptive statistics, they can be asked to join.

But another response is to acknowledge that the growing understanding of inferential statistics is becoming part of the job description of WEB developers. Dynamic sites are generating more and more data, and it turns out that trying to turn this data into useful knowledge is the responsibility of WEB developers and system administrators.

I advocate the latter response; This article is intended to help WEB developers and system administrators learn (or revisit, if knowledge is forgotten) the design and analysis skills needed to apply statistics to web traffic.

Making WEB data relevant to experimental design

Applying inferential statistics to WEB traffic requires more than learning the mathematical knowledge that is the basis of various statistical tests. The ability to correlate the data collection process with the key differences in the experimental design is also important: what is the measurement scale? What is the representativeness of the sample? What is the overall? What are the assumptions being tested?

To apply inferential statistics to WEB traffic, you need to think of the results as being generated by an experimental design, and then choose the analysis process that applies to the design of the experiment. Even though you might think that the results of WEB polls and access log data as experiments are superfluous, it's really important to do so. Why?

1. This will help you select the appropriate statistical test method.

2. This will help you draw the appropriate conclusions from the collected data.

In determining which appropriate statistical tests to use, an important aspect of the design of the experiment is the choice of metrics for data collection.

Examples of measurement criteria

The measurement scale simply specifies a procedure for assigning symbols, letters, or numbers to a phenomenon of interest. For example, the kilogram scale allows you to assign a number to an object, indicating the weight of the object according to the normalized offset of the measuring instrument.

There are four important metrics:

Fixed scale (ratio)-kilogram scale is an example of a fixed scale? The symbol assigned to an object's properties has a numeric meaning. You can perform various operations on these symbols, such as calculation ratios, and you cannot use these operations for values obtained by using less powerful metrics.

Fixed-distance scale (interval)-The distance (also known as the spacing) between any two adjacent units of measurement is equal in the fixed-distance scale, but 0 points are arbitrary. Examples of distance scales include measurements of longitude and tidal heights, and measurements of the beginning and ending of different years. The value of the fixed distance scale can be added and reduced, but multiplication is meaningless.

The order scale (rank)-The sequence scale can be applied to a set of sequential data, the sequence refers to the values and observations belonging to the scale can be ordered or accompanied by a rating scale. Common examples include "likes and dislikes" polls, where numbers are assigned to individual attributes (from 1 = very disgusted to 5 = very much). Typically, a group of ordered data has a natural order, but the gap between adjacent points on the scale does not have to be the same. For sequential data, you can count and sort, but not measure.

Categorical scale (nominal)-the scale of the measurement standard is the weakest form of measurement, mainly refers to assigning items to groups or categories. This measurement does not have quantity information and does not indicate that the item is sorted. The main numerical operations performed on the categorical scale data are the frequency counts of items in each category.

The following table compares the characteristics of each of the metrics:

Does the metric scale attribute have an absolute numerical meaning? Is it possible to perform most mathematical operations?

The scale of the scaling is. Is.

The distance scale is the same for the fixed distance, and 0 points is arbitrary. Add and subtract.

The order scale is not. Count and sort.

The definite class scale is not. can only be counted.

In this article, I will focus on the data collected by using the measured scaling scale, as well as the inference techniques applicable to the fixed data.

Using the fixed class scale

Almost all Web users-designers, customers, and system administrators-are familiar with the scaling of classes. Web polls are similar to access logs because they often use the class scale as a measure. In Web polls, users often ask people to choose the answer option (such as "Do you prefer brand A, brand B, or brand C?"). ") to measure people's preferences. Summarize data by counting the frequency of each type of answer.

Similarly, a common method of measuring site traffic is to divide each click or visit from one day within one weeks to the day, and then count the number of clicks or visits that appear on each day. In addition, you can (and do) the browser type, the type of operating system, and the country or region where the visitor is located-and any classification scale you want-to count clicks.

Because Web polls and access statistics need to count the number of times the data is grouped into a particular category of properties, you can analyze them with similar nonparametric statistical tests that allow you to make inferences based on the distribution shape rather than the overall parameters.

David Sheskin in his handbook of parametric and Non-parametric statistical procedures (p. 19th, 1997) is such a distinction between parametric and non-parametric tests:

The distinction used in this book to classify processes as parametric and nonparametric tests is based primarily on the level of measurement represented by the data being analyzed. As a general rule, the evaluation category/categorical scale data and sequence/rank-to-order data are classified as nonparametric tests, while those that evaluate the fixed-scale data or the scale data are classified as parameter tests.

Nonparametric tests are also useful when certain assumptions that are the basis of a parameter test are questionable, and non-parametric tests have a significant effect in detecting the overall difference when the parameter assumptions are not met. For the example of a Web poll, I used the nonparametric analysis process because Web polls typically use a class scale to record voter preferences.

I'm not suggesting that Web polls and Web Access statistics should always use a scaling metric, or that nonparametric statistical testing is the only way to analyze such data. It is not difficult to imagine (for example) such polls and surveys, which require the user to provide a numerical score (from 1 to 100) for each option, and the parametric statistical test is more appropriate.

Nonetheless, many WEB traffic includes editing the category count data, and by defining a distance scale (for example, from 17 to 21) and assigning each data point to a fixed scale (such as "Young people"), this data can be transformed into categorical scale data by using more powerful measurement metrics. The prevalence of frequency data (already part of the experience of WEB developers) makes focusing on nonparametric statistics a good starting point for learning how to apply inference techniques to data flow.

To keep this article reasonable, I'll confine my discussion of web traffic analysis to web polls. Keep in mind, however, that many WEB traffic can be represented by the class count data, and the inference techniques I discuss will allow you to do more than report simple counting data.

Starting from sampling

Suppose you have a weekly poll on your site www.NovaScotiaBeerDrinkers.com, asking members for comments on various topics. You've created a poll that asks members about their favorite beer brands (Nova Scotia Prov. Nova Scotia, Canada, has three well-known beer brands: Keiths, Olands, and schooner). In order to make the survey as wide as possible, you include "other" in your answer.

You receive 1,000 answers, please observe the results in table 1. (The results shown in this article are for demonstration purposes only and are not based on any actual investigation.) ）

Table 1. Beer poll Keiths olands schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

These figures seem to support the conclusion that Keiths is the most popular brand among Nova Scotia Prov. residents. According to these figures, can you come to this conclusion? In other words, can you infer from the results of the sample the overall Nova Scotia Prov. beer consumers?

Many of the factors associated with sample collection methods make the relative popularity inference incorrect. The sample may contain too many employees of the Keiths brewery; Perhaps you have not completely prevented a person from casting multiple votes, and this person may have caused the result to deviate, perhaps the person chosen to vote is different from those who have not been chosen to vote, maybe the Internet voter is different from the voter who is not online.

Most Web polls have difficulties with these interpretations. These explanatory difficulties arise when you try to draw conclusions about the overall parameters from the sample statistics. From an experimental design standpoint, one of the first questions to ask before collecting data is whether steps can be taken to help ensure that the sample represents the overall study.

If the overall conclusion of the study is that you are motivated to do a Web poll (rather than a pastime for site visitors), then you should implement some techniques to ensure one-person-one-vote (so that they must sign in to vote with a unique identity) and ensure that the voter sample is randomly selected (for example, a subset of randomly selected members). Then email them and encourage them to vote).

Ultimately, the goal is to eliminate (at least reduce) deviations, which may weaken the ability to draw conclusions about the overall study.

Test hypothesis

Assuming the Nova Scotia Prov. Beer Consumer statistics sample is not biased, can you now conclude that Keiths is the most popular brand?

To answer this question, consider a related question: if you want to get a sample of another Nova Scotia Prov. beer consumer, do you want to see the exact same result? In fact, you'll want to see some variation in the results observed in different samples.

Considering this expected sample variability, you may wonder whether the observed brand preference can be better illustrated by the fact that the variability in random sampling is more than the actual difference reflected in the overall study. In the statistical academic language, this sample variability statement is called false set (null hypothesis). (False set by symbol Ho) In this example, it is represented by a formula as a statement: in all categories of the answer, the desired number of answers is the same.

ho:# Keiths = # olands = # schooner = # Other

If you can exclude virtual assumptions, you have made some progress in answering the initial question whether Keiths is the most popular brand. Another acceptable assumption, then, is that the proportions of the responses are different in the overall study.

This "first Test false set" logic is applicable in many stages of the analysis of poll data. Exclude this imaginary hypothesis so that the data will not be completely different, and then you can continue to test for a more specific false setting, that is, keiths and schooner, or Keiths no difference from all other brands.

You continue to test the false setting instead of directly evaluating another hypothesis because it is easier to model the statistical modeling of things that people want to observe under virtual assumptions. Next, I will demonstrate how to model what is expected in a false setting so that I can compare the observations with the desired results under the imaginary assumptions.

Modeling of false setting: X-squared distribution statistics

So far, you've used a table that reports frequency counts (and percentages) for each answer option to summarize the results of a Web poll. To test for spurious setting (there is no difference between table cell frequencies), it is much easier to calculate the overall deviation metric that each table unit expects from your imaginary assumptions.

In this example of the beer welcome poll, the expected frequency under virtual assumptions is as follows:

Expected frequency = Number of observations/answer options

Desired frequency = 1000/4

Desired frequency = 250

To calculate the overall measure of how much the content in each cell differs from the desired frequency, you can sum all the differences to a total measure that reflects how much the observed frequency differs from the desired frequency: (285-250) + (250-250) + (215-250) + (250-250).

If you do this, you will find that the desired frequency is 0 because the deviation of the mean and always is 0. To solve this problem, the square of all the differences should be taken (this is the origin of the square in the X-squared distribution (Chi square)). Finally, to make the value of each sample (which has a different number of observations) comparable (in other words, normalize it), divide the value by the desired frequency. Therefore, the formula for the X-squared distribution statistic is as follows ("O" means "observed frequency", "E" equals "desired frequency"):

Figure 1. Formula of X-squared distribution statistics

If you calculate the X-squared distribution of the beer Welcome poll data, you get a value of 9.80. To test the virtual hypothesis, you need to know the probability of obtaining such a limit value if there is a random sampling variability. To get this probability, we need to understand what the sample distribution of the X-squared distribution is.

Observing the sampling distribution of the X-squared distribution

Figure 2. X-Squared distribution map

In each picture, the horizontal axis represents the resulting size of the X-squared distribution (the range shown in the figure is from 0 to 10). The vertical axis shows the probability of each X squared distribution value (or the relative frequency that appears).

When you study these X-squared distributions, be aware that when you change degrees of freedom (that is, DF) in your experiment, the shape of the probability function changes. For an example of poll data, the degree of freedom is calculated by writing down the number of answer options (k) in the poll and then using this value minus 1 (df = k-1).

Typically, when you increase the number of answer options in your experiment, the probability of getting a large X-squared distribution value decreases. This is because when you increase the answer option, you increase the number of variance values-(observation value-expectation) 2-you can ask for the total number of it. Therefore, when you increase the answer option, the statistical probability of obtaining a large X-squared distribution value should be increased, and the probability of obtaining a smaller X-squared distribution value will be reduced. This is why the shape of the sample distribution of the X-squared distribution varies with the DF value.

In addition, it is important to note that people are not interested in the fractional part of the result of the X-squared distribution, but rather on the total portion of the curve to the right of the obtained value. The mantissa probability tells you whether it is possible (such as a large mantissa area) or not (a small mantissa area) to get a limit value as you observe. (In fact, I don't use these graphs to calculate the mantissa probability, because I can implement a mathematical function to return the mantissa probability of a given X-squared distribution value.) This approach is used in the X-squared distribution program that I discussed later in this article. ）

To learn more about how these graphs are derived, you can see how to simulate the contents of a graph corresponding to DF = 2 (which represents k = 3). Imagine putting numbers 1, 2, and 3 in your hat, shaking it, selecting a number, and recording the selected number as an attempt. Try the experiment 300 times and then calculate the frequency of 1, 2, and 3.

Each time you do this experiment, you should expect the result to have a slightly different frequency distribution, which reflects the variability of the sample, and the distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can initialize the class with the following values: The number of times to do the experiment, the number of attempts made in each experiment, and the number of options for each experiment. The results of each experiment are recorded in an array named Outcomes.

Listing 1. Contents of the Multinomial class

multinomial.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Class Multinomial {

var $NExps;

var $NTrials;

var $NOptions;

var $Outcomes = array ();

function multinomial ($NExps, $NTrials, $NOptions) {

$this->nexps = $NExps;

$this->ntrials = $NTrials;

$this->noptions = $NOptions;

for ($i =0; $i < $this->nexps; $i +) {

$this->outcomes[$i] = $this->runexperiment ();

}

}

function Runexperiment () {

$Outcome = Array ();

for ($i = 0; $i < $this->nexps; $i + +) {

$choice = rand (1, $this->noptions);

$Outcome [$choice]++;

}

return $Outcome;

}

}

?>

Note that the Runexperiment method is a very important part of the script, which guarantees that the choices made in each experiment are random, and tracks what choices have been made so far in the simulation experiment.

In order to find the sampling distribution of the X-squared distribution, we simply obtain the results of each experiment and calculate the X-squared distribution statistics of the results. Because of the variability of random sampling, the X-squared distribution statistic varies with the experiment.

The following script writes the X-squared distribution statistics obtained for each experiment to an output file for later representation in a chart.

Listing 2. Writes the obtained X-squared distribution statistics to the output file

simulate.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Set time limit to 0 so script doesn ' t time out

Set_time_limit (0);

Require_once ". /init.php ";

Require Php_math. "Chi/multinomial.php";

Require Php_math. "Chi/chisquare1d.php";

Initialization parameters

$NExps = 10000;

$NTrials = 300;

$NOptions = 3;

$multi = new Multinomial ($NExps, $NTrials, $NOptions);

$output = fopen ("./data.txt", "w") or Die ("file won ' t open");

for ($i =0; $i < $NExps; $i + +) {

For each multinomial experiment, does chi Square analysis

$chi = new Chisquare1d ($multi->outcomes[$i]);

Load obtained Chi square value into sampling distribution array

$distribution [$i] = $chi->chisqobt;

Write obtained Chi square value to file

Fputs ($output, $distribution [$i]. " n ");

}

Fclose ($output);

?>

To visualize the results that are expected to run the experiment, the simplest way for me to do this is to load the Data.txt file into the Open source statistics package R, run the histogram command, and edit the diagram in the graphics editor as follows:

x = Scan ("Data.txt")

hist (x, 50)

As you can see, the histogram of these x-squared distribution values approximates the distribution of the continuous x-squared distribution of DF = 2, which is represented above.

Figure 3. The approximate value of the continuous distribution with the df=2

In the following sections, I will focus on how the X-squared distribution software used in this simulation experiment works. Typically, the X-squared distribution software will be used to analyze actual scaling data (such as WEB poll results, weekly traffic reports, or customer brand preference reports) rather than the simulation data you use. You may also be interested in other outputs generated by the software-such as summary tables and mantissa probabilities.

Instance variables of the X-squared distribution

The PHP-based X-squared distribution package I developed consists of classes used to analyze frequency data, and frequency data is categorized by one-dimensional or two-dimensional (chisquare1d.php and chisquare2d.php). My discussion will be limited to explaining how the Chisquare1d.php class works and how to apply it to one-dimensional WEB poll data.

Before proceeding, it should be stated that classifying data according to two dimensions (for example, classifying beer preferences by gender) allows you to begin to describe your results by looking up system relationships or conditional probabilities in a list of linked tables. While many of the following discussions will help you understand how the chisquare2d.php software works, other experiments, analysis, and visualization issues that are not discussed in this article are also required to be handled before using this class.

Listing 3 studies the fragment of the Chisquare1d.php class, which is made up of the following components:

1. A contained document

2. Class instance variables

Listing 3. Fragment of the X-squared distribution class with included file and instance variables

chisquare1d.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Require_once Php_math. "Dist/distribution.php";

Class Chisquare1d {

var $Total;

var $ObsFreq = array (); Observed frequencies

var $ExpFreq = array (); Expected frequencies

var $ExpProb = array (); Expected probabilities

var $NumCells;

var $ChiSqObt;

var $DF;

var $Alpha;

var $ChiSqProb;

var $ChiSqCrit;

}

?>

The top part of the script in Listing 3 contains a file named distribution.php. The included paths combine the Php_math constants set in the init.php file, assuming that the init.php file is already included in the calling script.

The included file distribution.php contains a method for generating sample distribution statistics for several commonly used sampling distributions (T distributions, F distributions, and X-squared distributions). The chisquare1d.php class must be able to access the X-squared distribution method in distribution.php to calculate the mantissa probability of the resulting x-squared distribution value.

The list of instance variables in this class is noteworthy because they define the result objects that are generated by the parsing process. This result object contains all the important details about the test, including three important X-squared distribution statistics-CHISQOBT, Chisqprob, and Chisqcrit. For more information about how to calculate each instance variable, you can look at the constructor method for that class, all of which are derived from there.

Constructor: The backbone of the X-squared distribution test

Listing 4 shows the constructor code for the X-squared distribution, which forms the backbone of the X-squared distribution test.

Listing 4. The constructor of the X-squared distribution

Class Chisquare1d {

function chisquare1d ($ObsFreq, $Alpha =0.05, $ExpProb =false) {

$this->obsfreq = $ObsFreq;

$this->expprob = $ExpProb;

$this->alpha = $Alpha;

$this->numcells = count ($this->obsfreq);

$this->DF = $this->numcells-1;

$this->total = $this->gettotal ();

$this->expfreq = $this->getexpfreq ();

$this->chisqobt = $this->getchisqobt ();

$this->chisqcrit = $this->getchisqcrit ();

$this->chisqprob = $this->getchisqprob ();

return true;

}

}

?>

The four notable aspects of a constructor method are:

1. The constructor accepts an array of observed frequencies, an alpha probability break point (cutoff score), and an optional array of expected probabilities.

2. The first six lines involve relatively simple assignments and computed values that are recorded so that the complete result object can be used to invoke the script.

3. The last four lines perform a lot of work to get the X-squared distribution statistics that you are most interested in.

4. This class only implements the X-squared distribution test logic. There is no output method associated with the class.

You can study the class methods included in the code download for this article to learn more about how to calculate the value of each result object (see Resources).

Handling Output issues

The code in Listing 5 shows how easy it is to perform an X-squared distribution analysis using the Chisquare1d.php class. It also demonstrates the processing of the output problem.

The script invokes a wrapper script named chisquare1d_html.php. The purpose of this wrapper script is to separate the logic of the X-squared distribution process from its representation. The _html suffix indicates that the output is for a standard Web browser or other device that displays HTML.

Another purpose of wrapper scripting is to organize the output in a way that facilitates understanding of the data. To achieve this, the class contains two methods for displaying the results of an X-squared distribution analysis. The Showtablesummary method shows the first output table shown behind the Code (table 2), and Showchisquarestats shows the second output table (table 3).

Listing 5. Organizing data with wrapper scripts

beer_poll_analysis.php

Require_once ". /init.php ";

Require_once Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (285, 250, 215, 250);

$Alpha = 0.05;

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha);

$Chi->showtablesummary ($Headings);

echo "

";

$Chi->showchisquarestats ();

?>

The script generates the following output:

Table 2. Expected frequency and variance for running wrapper scripts

Keiths olands Schooner Other Total

Observation Value 285 250 215 250 1000

Expected value 250 250 250 250 1000

Variance 4.90 0.00 4.90 0.00 9.80

Table 3. Various X-squared distribution statistics obtained by running the wrapper script

Probability critical value of DF acquisition value

X-squared Distribution 3 9.80 0.02 7.81

Table 2 shows the desired frequency and the variance measure (O-E) 2/e for each unit. The sum of the variance values equals the obtained X-squared distribution (9.80) value, which is displayed in the lower-right cell of the summary table.

Table 3 reports the various X-squared distribution statistics. It includes the degrees of freedom used in the analysis, and again reports the obtained X-squared distribution value. The obtained X-squared distribution value is re-expressed as the mantissa probability value-in this case, 0.02. This means that, under virtual assumptions, the probability of observing an X-squared distribution limit of 9.80 is 2% (which is a fairly low probability).

If you decide to exclude false sets-the results can be obtained according to the random sampling variability of the 0 distribution, then most statisticians will not be controversial. Your poll results are more likely to reflect the real difference in Nova Scotia Prov. beer Consumers ' overall preference for beer brands.

To confirm this conclusion, the obtained X-squared distribution value can be compared with the critical value.

Why is critical value important? The threshold value is based on an important level (that is, the alpha break level) set for the analysis. The Alpha break value is set to 0.05 by convention (this is the value used by the above analysis). This setting is used to find the position (or threshold) where the mantissa area equals the alpha break value (0.05) in the sampling distribution of the X-squared distribution.

In this article, the obtained X-squared distribution value is greater than the critical value. This means that the threshold for keeping the false setting is exceeded. Another hypothesis-there is a proportional difference in the overall object-may be more statistically correct.

In automated analysis of data flow, Alpha break settings can set output filtering for knowledge-discovery algorithms such as automatic interaction detection of X-squared distributions (Chi square Automatic Interaction detection,chiad). Such algorithms themselves are unable to provide detailed guidance on discovering truly useful patterns.

A re-poll

Another interesting application of the one-way X-squared distribution test is to re-poll to see if people's answers have changed.

Suppose that after a while, you plan to conduct another WEB poll of Nova Scotia Prov. beer consumers. Once again you asked about their favorite beer brands and now observe the following results:

Table 4. A new beer poll

Keiths olands Schooner Other

385 (27.5%) 350 (25%) 315 (22.5%) 350 (25%)

The old data is as follows:

Table 1. Old Beer poll (show again)

Keiths olands Schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The obvious difference between the poll results is that there were 1,000 respondents to the first poll and 1,400 respondents for the second time. The main impact of these additional respondents was to increase the frequency count of each response case by 100 points.

When you are ready to analyze a new poll, you can use the default method-calculate the desired frequency to analyze the data, or you can initialize the analysis with the expected probability of each result (based on the scale observed in the previous poll). In the second case, you load the previously obtained proportions into an array of expected probabilities ($ExpProb) and use them to calculate the desired frequency value for each answer option.

Listing 6 shows the beer poll analysis code for detecting preference changes:

Listing 6. Changes in detection preferences

beer_repoll_analysis.php

Require_once ". /init.php ";

Require Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (385, 350, 315, 350);

$Alpha = 0.05;

$ExpProb = Array (. 285,. 250,. 215,. 250);

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha, $ExpProb);

$Chi->showtablesummary ($Headings);

echo "

";

$Chi->showchisquarestats ();

?>

Tables 5 and 6 show the HTML output generated by the beer_repoll_analysis.php script:

Table 5. Expected frequency and variance obtained by running beer_repoll_analysis.php

Keiths olands Schooner Other Total

Observation value 385 350 315 350 1400

Expected value 399 350 301 350 1400

Variance 0.49 0.00 0.65 0.00 1.14

Table 6. Statistics of various X-squared distribution statistics obtained by running beer_repoll_analysis.php

Probability critical value of DF acquisition value

X-squared Distribution 3 1.14 0.77 7.81

Table 6 shows that under virtual assumptions, the probability of obtaining an X-squared distribution value of 1.14 is 77%. We cannot rule out the illusion that Nova Scotia Prov. Beer consumer preferences have changed since the last poll. Any difference between the observed frequency and the desired frequency can be interpreted as the expected sampling variability of the Nova Scotia Prov. same beer consumer. Given that the initial poll results were converted only by adding a constant of 100 to each of the previous poll results, this zero-discovery should not be surprising.

However, you can imagine that the results have changed and imagine that these results may imply that another brand of beer is becoming more prevalent (note the variance size reported at the bottom of each column in table 5). You can further imagine that this finding has significant financial implications for the brewery under discussion, as pub owners tend to buy the best-selling beers in the pub.

These results will be examined in great detail by the owner of the brewery, who will question the suitability of the analytical process and the experimental methods, and, in particular, the representativeness of the sample. If you're planning a WEB experiment that might have important practical implications, you need to pay equal attention to the experimental methods used to collect the data and the analytical techniques used to derive inferences from the data.

Therefore, this article not only lays a good foundation for you to enhance your effective understanding of WEB data, it also provides recommendations on how to protect your statistical test choices and makes the conclusions obtained from the data more reasonable.

Application-Learned knowledge

In this article, you've learned how to apply inferential statistics to the ubiquitous frequency data used to summarize web traffic, focusing on the analysis of web poll data. However, the simple one-way X-squared distribution analysis process discussed can also be applied effectively to other types of data streams (access logs, survey results, customer profiles, and customer orders) in order to convert raw data into useful knowledge.

When applying inferential statistics to Web data, I also described the prospect of looking at data flow as a web experiment to improve the likelihood of referencing experimental design considerations when making inferences. Generally, you cannot make inferences because you lack sufficient control over the process of data collection. However, you can change this situation if you are more proactive when you apply the design principles of your experiment to the Web data collection process (for example, by randomly selecting voters during your web poll).

Finally, I showed how to simulate the sampling distributions of the X-squared distributions of different degrees of freedom, not just the source. In doing so, the desired frequency for the measurement category is less than 5 (in other words, the small N experiment)-I also demonstrated a workaround (using small $NTrials values to simulate the sample distribution of the experiment) to prohibit the use of the X-squared distribution test. Therefore, I do not just use DF in the study process to calculate the probability of sample results, for a small number of attempts may also need to use the $NTrials value as a parameter to obtain the probability of the observed X-squared distribution results.

It is worthwhile to consider how you might analyze the small N experiment, because you would typically want to analyze your data before data acquisition is complete-when the cost of each observation is expensive, when the observation takes a long time to obtain, or just because you are curious. When trying this level of WEB data analysis, it's a good idea to keep these two questions in mind:

* Do you have reason to infer under small N conditions?

* Simulations help you decide what inferences to get in these environments?