Source: Internet
Author: User

web| Data Meter Your data analysis, do more things than simple original count

Effective and multi-level analysis of web data is a key factor in the survival of many web-enabled enterprises, the design (and decision) of data analysis validation is often the work of system administrators and internal application designers, who may have no more knowledge of statistics than to be able to make the original count into tables. In this article, Paul Meagher teaches Web developers the skills and concepts needed to apply inferential statistics to web data streams.

Dynamic Web sites generate a lot of data-access logs, poll and survey results, customer profiles, orders, and more-web developers work not only to create applications that generate this data, but also to develop applications and methods that make sense for these streams of data.

Typically, WEB developers do not have enough to respond to the growing data analysis requirements generated by the management site. In general, there is no better way for WEB developers to reflect data flow characteristics than to report various descriptive statistics. There are many inferential statistical steps (methods for estimating the overall parameters based on the sample data) that can be fully exploited, but are not currently applied.

For example, WEB access statistics, as currently edited, are just a number of frequency counts that are grouped in various ways. The results of opinion polls and surveys in terms of their original count and percentages abound.

The statistical analysis that developers use to handle data streams in a more superficial way may be sufficient, and we should not expect too much. After all, there are professionals involved in more complex data flow analysis, and they are statisticians and trained analysts. When organizations need more than descriptive statistics, they can invite them to join.

But another response is to admit that a growing understanding of inferential statistics is becoming part of the Web developer's job description. Dynamic sites are generating more and more data, and it turns out that trying to turn this data into useful knowledge is the responsibility of WEB developers and system administrators.

I advocate the latter response; This article is intended to help Web developers and system administrators learn (or revisit, if knowledge has forgotten) the design and analysis skills needed to apply inferential statistics to WEB data streams.

Making WEB data relevant to experimental design

Applying inferential statistics to WEB data streams requires more than just learning to be a mathematical knowledge based on a variety of statistical tests. The ability to correlate data collection processes with key differences in experimental design is also important: what is the measurement scale? What is the representativeness of the sample? What is the overall? What is the hypothesis being tested?

To apply inferential statistics to WEB data streams, it is necessary to think of the results as being generated by experimental design, and then select the analytical process that applies to the design of the experiment. Even though you may think that it is superfluous to consider WEB polls and access log data as the result of an experiment, it is really important to do so. Why?

1. This will help you to select the appropriate statistical test method.

2. This will help you draw appropriate conclusions from the data collected.

One important aspect of experimental design in determining which appropriate statistical tests to use is to select metrics for data collection.

Examples of metrics

The measurement scale simply specifies a step to assign a symbol, letter, or number to the phenomenon of interest. For example, the kilogram scale allows you to assign numbers to an object, indicating the weight of the object according to the standardized offset of the measuring instrument.

There are four important metrics:

The scaling scale (ratio)-kilogram scale is an example of a scaling scale? The symbols assigned to an object's properties have a numeric meaning. You can perform various operations on these symbols, such as calculation ratios, and you cannot use these operations for values obtained by using less powerful metrics.

Fixed-distance scale (interval)-The distance (also known as spacing) between any two adjacent units of measurement is equal at the fixed-distance scale, but 0 points are arbitrary. Examples of distance scales include measurements of longitude and tidal heights, and measurements of the beginning and the beginning of different years. The value of the fixed-distance scale can be added and decreased, but the multiplication and division is meaningless.

The sequence scale (rank)-order scale can be applied to a set of sequential data, in which the values and observations belonging to the scale can be ordered or accompanied by a rating scale. Common examples include "likes and dislikes" polls, which assign numbers to individual attributes (from 1 = very dislike to 5 = very much). Generally, a group of ordered data categories have a natural order, but the difference between adjacent points on the scale does not have to be the same. For sequential data, you can count and sort, but not measure.

Fixed-class scale (nominal)-a standard-determining scale is the weakest form of a measure, mainly referring to assigning items to groups or categories. This measurement has no quantitative information and does not indicate that the item is sorted. The primary numerical operation for a fixed-class scale data is the frequency count of items in each category.

The following table compares the characteristics of each of these metrics:

Does measuring standard scale attribute have absolute numerical meaning? Can you perform most mathematical operations?

The scaling scale is. Is.

The fixed distance scale is the same for the fixed distance scale, and 0 points are arbitrary. Plus and minus.

The order scale is not. Count and sort.

The definite class scale is not. can only count.

In this article, I will focus on the data collected by using a fixed scale of measurement, and the inference techniques that apply to fixed-class data.

Using a definite class scale

Almost all Web users-designers, customers, and system administrators-are familiar with the scaling scale. Web polls and access logs are similar because they often use a fixed-class metric as a metric. In Web polls, users often ask people to choose the answer option, such as "Do you prefer brand A, brand B, or brand C?" ") to measure people's preferences. Summarize the data by counting the frequency of each answer.

Similarly, the usual way to measure web traffic is to divide each click or visit within a one-week period into one day, and then count the number of clicks or accesses that appear each day. In addition, you can (and indeed can) count the hits by browser type, operating system type, and the country or region where the visitor is located-and any category scale you want.

Because Web polls and access statistics need to count the number of times the data is grouped into a particular category of properties, you can analyze them using a similar nonparametric statistical test (which allows you to make inferences based on the distribution shape rather than the overall parameters).

In his book Handbook of Parametric and Non-parametric statistical procedures (page 19th, 1997), David Sheskin distinguishes between parametric and nonparametric tests:

The differences used in this book to classify processes into parametric and nonparametric tests are based primarily on the level of measurement represented by the data being analyzed. As a general rule, the inference statistical tests for the assessment category/fixed-class scale data and order/rank-order data are classified as nonparametric tests, while those that evaluate the fixed-scale data or the calibration scale data are classified as parameter tests.

Nonparametric testing is also useful when some assumptions that are the basis for parameter testing are questionable; When the parameter assumptions are not met, the Nonparametric test has a great effect on the overall difference detection. For the Web poll example, I used the nonparametric parsing process, because WEB polls usually use a fixed-class scale to record voter preferences.

I'm not suggesting that Web polls and Web Access statistics should always use a standard metric, or that nonparametric statistical testing is the only way to analyze such data. It is not difficult to envisage (for example) such polls and surveys, which require the user to provide a numerical rating (from 1 to 100) for each option, so that the statistical examination of the parameters is more appropriate.

However, many WEB data streams include editing category count data, and by defining a fixed-distance scale (for example, from 17 to 21) and assigning each data point to a fixed scale (such as "Young people"), the data can be converted to a fixed-scale data (by using a more powerful measurement of metrics). The prevalence of frequency data (already part of WEB developer experience) makes focusing on nonparametric statistics a good starting point for learning how to apply inferential techniques to data flow.

To keep this article reasonable, I'll confine my discussion of Web data flow analysis to web polls. But keep in mind that many WEB data streams can be represented by a set class count data, and the inference technique I discuss will allow you to do more than report simple counting data.

Start with the sample

Let's say you have a weekly poll on your site www.NovaScotiaBeerDrinkers.com, asking members for advice on a variety of topics. You have created a poll asking members about their favorite beer brands (Nova Scotia in Nova Scotia Prov., Canada) with three well-known beer brands: Keiths, Olands and schooner. In order to make the survey as wide as possible, you include "other" in your answer.

You receive 1,000 answers, please observe the results in table 1. (The results shown in this article are for demonstration purposes only and are not based on any actual investigation.) ）

Table 1. Beer poll Keiths olands schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The data seem to support the conclusion that Keiths is the most popular brand among Nova Scotia Prov. residents. Based on these figures, can you come to this conclusion? In other words, can you infer from the results obtained from a sample the overall Nova Scotia Prov. beer consumers?

Many of the factors associated with sample collection methods make the inference of a relatively popular degree incorrect. The sample may contain too many employees of the Keiths brewery; You may not be able to completely prevent a person from casting a number of votes, and that person may make a difference in the outcome; perhaps the person chosen to vote differs from the person who has not been singled out for a vote; perhaps the voters on the internet are different from those who do not.

Most Web polls have difficulties with these explanations. These explanations are difficult when you try to draw conclusions about the overall parameters from the sample statistics. From an experimental design standpoint, the first question to ask before collecting data is whether steps can be taken to help ensure that the sample represents the overall study.

If the overall conclusion of the study is that you are motivated to do a Web poll (rather than a distraction for site visitors), then you should implement some techniques to ensure that one person and one vote (so they must log on with a unique identity to vote) and make sure to randomly select the voter sample (for example, randomly select a subset of members, Then email them and encourage them to vote.

Ultimately, the goal is to eliminate (at least reduce) deviations that may weaken the ability to draw conclusions on the overall study.

Test assumptions

Assuming that the Nova Scotia Prov. Beer Consumer statistics sample has not deviated, can you now draw the conclusion that Keiths is the most popular brand?

To answer this question, consider a related question: if you want to get a sample of another Nova Scotia Prov. beer consumer, do you want to see exactly the same result? In fact, you would expect that the results observed in different samples would change somewhat.

Considering this expected variability of the sample, you may wonder whether it is better to demonstrate the observed brand preference by randomly sampling variability than by reflecting the actual differences in the overall study. In statistical scholarly language, this sample variability description is called a false set (null hypothesis). (False set by symbol Ho) In this case, it is represented as a statement with a formula: The expected number of responses is the same in all categories of responses.

ho:# Keiths = # olands = # schooner = # Other

If you can rule out imaginary assumptions, you have made some progress in answering the initial question whether Keiths is the most popular brand. Another acceptable assumption, then, is that the percentages of responses are different in the overall study.

This "first Test false" logic applies in many stages of the analysis of poll data. Exclude this imaginary assumption so that the data will not be completely different, and then you can continue to examine a more specific false setting, namely Keiths and schooner, or Keiths and all other brands.

You continue to examine false sets rather than directly evaluate another hypothesis, because it is easier to model the things people want to see in the virtual assumptions. Next, I'll demonstrate how to model the things I expect in a false setting so that I can compare the observations with the results I expect in a virtual hypothesis.

Modeling a false setting: X-squared distribution statistics

So far, you have used a table that reports the frequency counts (and percentages) of each answer option to summarize the results of the WEB poll. To examine the false set (no difference between the frequency of the table cells), it is much easier to calculate the overall deviation metric for each table cell than you expect under virtual assumptions.

In this sample beer popularity poll, the expected frequencies under virtual assumptions are as follows:

Expected frequency = Number of observations/answer options

Expected frequency = 1000/4

Expected frequency = 250

To calculate the overall measure of how much of an answer in each cell differs from the expected frequency, you can sum all the differences to a total measure of how much difference between the observed frequency and the desired frequency: (285-250) + (250-250) + (215-250) + (250-250).

If you do this, you will find that the expected frequency is 0, because the average deviation is always 0. To solve this problem, you should take the square of all the difference values (this is the origin of the square in the X-squared distribution (Chi square). Finally, to make the value of each sample (which has different observations) comparable (in other words, standardizing it), divide the value by the desired frequency. Therefore, the formula for the X square distribution statistic is as follows ("O" means "observation frequency", "E" equals "desired frequency"):

Figure 1. The formula of X square distribution statistics

If you calculate the X-squared distribution of the beer popularity poll data, you get a value of 9.80. To test the virtual hypothesis, it is necessary to know the probability of obtaining such a limit if there is a random sampling variability. To get this probability, we need to understand what the sampling distribution of the X-squared distribution is.

Observe the sampling distribution of the X-squared distribution

Figure 2. X Square Distribution Chart

In each diagram, the horizontal axis represents the resulting X square distribution value (the range shown in the figure is 0 to 10). The vertical axis shows the probability of each X squared distribution value (or the relative frequency of occurrence).

When you look at these X-squared maps, be aware that the shape of the probability function changes when you change the degree of freedom (that is, DF) in the experiment. For an example of a poll data, the degree of freedom is calculated by taking the number of answer options (k) in the poll and then using this value minus 1 (df = k-1).

Typically, when you increase the number of answer options in an experiment, the probability of getting a larger X-squared distribution is reduced. This is because when you increase the answer option, you increase the number of variance values-(observation-expected value) 2-you can ask for its total. Therefore, when you increase the answer option, the statistical probability of getting a large X square value should be increased, while the probability of obtaining a smaller X-squared distribution is reduced. This is why the shape of the sample distribution of the X-squared distribution varies with the DF value.

Also, note that people are usually not interested in the decimal portion of the X squared distribution result, but rather on the total portion of the curve to the right of the obtained value. The mantissa probability tells you whether it is possible (such as a large mantissa area) to obtain a limit value like the one you have observed or not (a small mantissa area). (Actually, I don't use these graphs to compute the mantissa probability, because I can implement a mathematical function to return the mantissa probability for a given X squared distribution.) This approach is used in the X square distribution program that I discussed later in this article. ）

To learn more about how these graphs derive, you can see how to simulate the contents of a graph corresponding to DF = 2, which represents k = 3. Imagine putting numbers 1, 2, and 3 in a hat, shaking it, selecting a number, and then recording the selected number as an attempt. Try the experiment 300 times, and then calculate the frequency of the 1, 2, and 3 occurrences.

Each time you do this experiment, you should expect the results to have a slightly different frequency distribution, which reflects the variability of the sample, and this distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can initialize the class with the following values: The number of times to experiment, the number of attempts in each experiment, and the number of options per trial. The results of each experiment are recorded in an array named outcomes.

Listing 1. Content of the Multinomial class

<?php

multinomial.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Class Multinomial {

var $NExps;

var $NTrials;

var $NOptions;

var $Outcomes = array ();

function multinomial ($NExps, $NTrials, $NOptions) {

$this->nexps = $NExps;

$this->ntrials = $NTrials;

$this->noptions = $NOptions;

For ($i =0 $i < $this->nexps; $i + +) {

$this->outcomes[$i] = $this->runexperiment ();

}

}

function Runexperiment () {

$Outcome = Array ();

for ($i = 0; $i < $this->nexps; $i + +) {

$choice = rand (1, $this->noptions);

$Outcome [$choice]++;

}

return $Outcome;

}

}

?>

Note that the Runexperiment method is a very important part of the script, and it guarantees that the choices made in each experiment are random and track what choices have been made in the simulation experiments so far.

In order to find the sampling distribution of the X-squared distribution statistics, only the results of each experiment are obtained and the X-squared distribution statistics of the results are computed. Because of the variability of random sampling, the X-squared distribution statistics vary with the experiment.

The following script writes the X-squared distribution statistics obtained in each experiment to an output file to be represented later in the chart.

Listing 2. Writes the obtained X squared distribution statistics to the output file

<?php

simulate.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Set time limit to 0 so script doesn ' t time out

Set_time_limit (0);

Require_once ". /init.php ";

Require Php_math. "Chi/multinomial.php";

Require Php_math. "Chi/chisquare1d.php";

Initialization parameters

$NExps = 10000;

$NTrials = 300;

$NOptions = 3;

$multi = new Multinomial ($NExps, $NTrials, $NOptions);

$output = fopen ("./data.txt", "w") OR die ("File won ' t open");

for ($i =0; $i < $NExps; $i + +) {

For each multinomial experiment, does chi Square analysis

$chi = new Chisquare1d ($multi->outcomes[$i]);

Load obtained Chi square value into sampling distribution array

$distribution [$i] = $chi->chisqobt;

Write obtained Chi square value to file

Fputs ($output, $distribution [$i]. " n ");

}

Fclose ($output);

?>

To visualize the desired results of running this experiment, the easiest way for me to do this is to load the Data.txt file into the open source statistics packet R, run the histogram command, and edit the chart in the graphics editor as follows:

x = Scan ("Data.txt")

hist (x, 50)

As you can see, the histogram of these X-squared distributions is similar to the distribution of the continuous x-squared distribution of DF = 2 above.

Figure 3. The value of the continuous distribution approximation with the df=2

In the following sections, I will focus on how the X-squared distribution software used in the simulation experiment works. Typically, the X-squared distribution software is used to analyze the actual fixed-class scale data (such as WEB poll results, weekly traffic reports, or customer brand preference reports) rather than the analog data you use. You may also be interested in other outputs generated by the software-such as summary tables and mantissa probabilities.

Instance variables of the X square distribution

The PHP-based X-squared distribution package I developed consists of classes used to analyze frequency data, which are classified according to one-dimensional or two-D (chisquare1d.php and chisquare2d.php). My discussion will be limited to explaining how the Chisquare1d.php class works and how to apply it to one-dimensional Web poll data.

Before continuing, it should be explained that classifying data by two-D (for example, by sex to classify beer preferences) allows you to start by looking for system relationships or conditional probabilities in a column-table cell to illustrate your results. Although many of the following discussions will help you understand how the chisquare2d.php software works, other experiments, analysis, and visualization issues that are not discussed in this article have to be addressed before using this class.

Listing 3 studies the fragment of the Chisquare1d.php class, which is composed of the following parts:

1. A contained document

2. Class instance variables

Listing 3. Fragment of the X-squared distribution class with the contained file and instance variables

<?php

chisquare1d.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Require_once Php_math. "Dist/distribution.php";

Class Chisquare1d {

var $Total;

var $ObsFreq = array (); Observed frequencies

var $ExpFreq = array (); Expected frequencies

var $ExpProb = array (); Expected probabilities

var $NumCells;

var $ChiSqObt;

var $DF;

var $Alpha;

var $ChiSqProb;

var $ChiSqCrit;

}

?>

The top part of this script in Listing 3 contains a file named distribution.php. The included paths combine the Php_math constants set in the init.php file, assuming that the init.php file is already contained in the calling script.

The included file distribution.php contains a method for generating sampling distribution statistics for several commonly used sampling distributions (T, F, and X Square distributions). The chisquare1d.php class must be able to access the X square distribution method in the distribution.php to calculate the mantissa probability of the resulting x-squared distribution value.

The list of instance variables in this class is noteworthy because they define the result objects that are generated by the profiling process. This result object contains all the important details about the test, including three important X-squared distribution statistics-CHISQOBT, Chisqprob, and Chisqcrit. For more information about how to calculate each instance variable, you can look at the constructor method for that class, all of which are from there.

Constructors: The backbone of the X-squared distribution test

Listing 4 shows the constructor code for the X squared distribution, which forms the backbone of the X square distribution test.

Listing 4. The constructor of the X square distribution

<?php

Class Chisquare1d {

function chisquare1d ($ObsFreq, $Alpha =0.05, $ExpProb =false) {

$this->obsfreq = $ObsFreq;

$this->expprob = $ExpProb;

$this->alpha = $Alpha;

$this->numcells = count ($this->obsfreq);

$this->DF = $this->numcells-1;

$this->total = $this->gettotal ();

$this->expfreq = $this->getexpfreq ();

$this->chisqobt = $this->getchisqobt ();

$this->chisqcrit = $this->getchisqcrit ();

$this->chisqprob = $this->getchisqprob ();

return true;

}

}

?>

The four aspects worth noting in the constructor method are:

1. The constructor accepts an array of observed frequencies, an alpha probability fracture point (cutoff score), and an optional expected probability.

2. The first six lines involve relatively simple assignments and recorded computed values so that the full result object can be used to invoke the script.

3. The last four lines perform a large amount of work to get the X-squared distribution statistics that are most interesting to you.

4. This class implements only the X-squared distribution test logic. There is no output method associated with the class.

You can study the class methods included in the code download for this article to learn more about how to calculate values for each result object (see Resources).

Handling Output problems

The code in Listing 5 shows how easy it is to perform an X-squared distribution analysis using the Chisquare1d.php class. It also demonstrates the handling of the output problem.

The script invokes a wrapper script named chisquare1d_html.php. The purpose of this wrapper script is to separate the logic of the X squared distribution process from its representation. The _html suffix indicates that the output is for a standard Web browser or other device that displays HTML.

Another purpose of the wrapper script is to organize the output in a way that facilitates understanding of the data. To achieve this, the class contains two methods for displaying the results of the X square distribution analysis. The Showtablesummary method shows the first output table shown after the Code (table 2), while the showchisquarestats shows the second output table (table 3).

Listing 5. Organizing data with wrapper scripts

<?php

beer_poll_analysis.php

Require_once ". /init.php ";

Require_once Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (285, 250, 215, 250);

$Alpha = 0.05;

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha);

$Chi->showtablesummary ($Headings);

echo "<br><br>";

$Chi->showchisquarestats ();

?>

The script produces the following output:

Table 2. Expected frequency and variance obtained by running wrapper script

Keiths olands Schooner Other Total

Observation Value 285 250 215 250 1000

Expected 250 250 250 250 1000

Variance 4.90 0.00 4.90 0.00 9.80

Table 3. Statistics for various X-squared distributions obtained by running the wrapper script

Probability critical value of DF obtained value

X Square Distribution 3 9.80 0.02 7.81

Table 2 shows the expected frequency and the variance metric (O-E) 2/e for each cell. The sum of the variance values is equal to the obtained X square distribution (9.80) value, which is displayed in the lower-right cell of the summary table.

Table 3 reports various X-squared distribution statistics. It includes the degrees of freedom used in the analysis, and again reports the obtained X squared distribution value. The obtained X-squared distribution value is again expressed as the mantissa probability value-in this case 0.02. This means that, under imaginary assumptions, the probability of an X-squared limit of 9.80 is observed to be 2% (this is a fairly low probability).

If you decide to exclude false sets-the results can be obtained by a random sampling of 0 distributions, then most statisticians will not be controversial. Your poll results are more likely to reflect the real difference between the Nova Scotia Prov. Beer Consumer's overall preference for beer brands.

To confirm this conclusion, we can compare the value of the obtained X-squared distribution with the critical value.

Why is critical value important? The threshold value is based on an important level (that is, the Alpha disconnect level) set for the analysis. The Alpha fracture value is set to 0.05 by convention (this value is used for the above analysis). This setting is used to find the position (or critical value) of the sample distribution of the X square distribution that contains the Mantissa region equal to the Alpha fracture value (0.05).

In this paper, the value of the X squared distribution is greater than the critical value. This means that the threshold is exceeded to maintain the false set description. Another assumption-there is a proportional difference in the overall object-is probably more accurate statistically.

In the automated analysis of data streams, alpha-disconnect settings can set output filtering for knowledge-discovery algorithms (such as the X-squared distribution automatic interaction detection (Chi Square Automatic interaction detection,chiad)). Such an algorithm itself is unable to provide detailed guidance on discovering really useful patterns.

A new poll.

Another interesting application of one-way X-square distribution testing is to re-examine polls to see if people's responses have changed.

Suppose that after a while you intend to conduct another Web poll of Nova Scotia Prov. beer consumers. Once again you ask about their favorite beer brands and now observe the following results:

Table 4. The new beer poll

Keiths olands Schooner Other

385 (27.5%) 350 (25%) 315 (22.5%) 350 (25%)

The old data looks like this:

Table 1. Old Beer poll (again shown)

Keiths olands Schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The obvious difference between the polls was that there were 1,000 respondents in the first poll and 1,400 respondents for the second time. The main effect of these additional respondents was to increase the frequency count of each response to 100 points.

When you are ready to analyze new polls, you can use the default method to calculate the expected frequency to analyze the data, or you can initialize the analysis with the expected probability of each result (based on the proportions observed in the previous poll). In the second scenario, you load the proportions that were previously obtained into the expected probability array ($ExpProb) and use them to compute the expected frequency values for each answer option.

Listing 6 shows the beer poll analysis code for detecting preference changes:

Listing 6. Changes in detection preferences

<?php

beer_repoll_analysis.php

Require_once ". /init.php ";

Require Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (385, 350, 315, 350);

$Alpha = 0.05;

$ExpProb = Array (. 285,. 250,. 215,. 250);

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha, $ExpProb);

$Chi->showtablesummary ($Headings);

echo "<br><br>";

$Chi->showchisquarestats ();

?>

Tables 5 and 6 show the HTML output generated by the beer_repoll_analysis.php script:

Table 5. Expected frequency and variance for running beer_repoll_analysis.php

Keiths olands Schooner Other Total

Observation value 385 350 315 350 1400

Expected 399 350 301 350 1400

Variance 0.49 0.00 0.65 0.00 1.14

Table 6. Statistic statistics of various X-square distributions obtained by running beer_repoll_analysis.php

Probability critical value of DF obtained value

X Square Distribution 3 1.14 0.77 7.81

Table 6 shows that the probability of obtaining an X-squared distribution value of 1.14 is 77% under the condition of virtual assumption. We cannot rule out such a false setting, that is, since the last poll, Nova Scotia Prov. Beer consumer preferences have changed. Any difference between the observed frequency and the expected frequency can be explained as the expected sampling variability of the same beer consumer Nova Scotia Prov.. Given that the conversion of the original poll results was done only by adding a constant 100 to each of the previous polls, there should be no surprise in this zero discovery.

However, you can assume that the results have changed and assume that the results may imply that another brand of beer is becoming more popular (note the variance size reported at the bottom of each column in table 5). You can further imagine that this finding has significant implications for the financial aspects of the brewery that is being discussed, since bar bosses tend to buy the best selling beers in pubs.

These results will be examined in great detail by the brewery bosses, who question the suitability of the analytical process and the experimental method, and, in particular, question the representativeness of the sample. If you plan to do a WEB experiment that might have important practical implications, you need to pay equal attention to the experimental methods used to collect data and the analytical techniques used to infer from the data.

Therefore, this article not only lays a good foundation for you to enhance your effective understanding of WEB data, it also provides recommendations on how to protect your statistical test selections and makes the conclusions obtained from the data more reasonable.

The knowledge learned from the application

In this article, you've learned how to apply inferential statistics to the ubiquitous frequency data used to summarize web data streams, focusing on the analysis of web poll data. However, the simple one-way X-squared distribution analysis process is also effectively applied to other types of data flows (access logs, survey results, customer profiles, and customer orders) to transform the raw data into useful knowledge.

When applying inferential statistics to Web data, I also introduced the need to view data streams as results of web experiments, so as to increase the likelihood of referencing experimental design considerations when making inferences. Usually you don't have enough control over the process of data acquisition, so you can't make inferences. However, if you are more proactive when you apply the design principles of the experiment to the Web data collection process (for example, by randomly selecting voters in your Web poll process), you can change this situation.

Finally, I showed how to simulate the sampling distribution of the X-squared distributions of different degrees of freedom, not just the source. In doing so, the expected frequency for the measurement category is less than 5 (in other words, the small N experiment)-I also demonstrated a workaround (using a sampling distribution of small $NTrials values to simulate experiments) to prohibit the use of the X-squared distribution test. Therefore, I do not only use DF in the research process to calculate the probability of the sample results, for a small number of attempts, may also need to use the $NTrials value as a parameter to obtain the observed X square distribution of the probability of the result.

It is worthwhile to consider how you might analyze small N experiments, because you might normally want to analyze your data before the data is complete-when the cost of each observation is expensive, it takes a long time to get it, or just because you're curious. When you try this level of WEB data analysis, it's a good idea to keep the following two questions in mind:

* Do you have any reason to infer under small N conditions?

* Does the simulation help you decide what inferences to get in these environments?

Effective and multi-level analysis of web data is a key factor in the survival of many web-enabled enterprises, the design (and decision) of data analysis validation is often the work of system administrators and internal application designers, who may have no more knowledge of statistics than to be able to make the original count into tables. In this article, Paul Meagher teaches Web developers the skills and concepts needed to apply inferential statistics to web data streams.

Dynamic Web sites generate a lot of data-access logs, poll and survey results, customer profiles, orders, and more-web developers work not only to create applications that generate this data, but also to develop applications and methods that make sense for these streams of data.

Typically, WEB developers do not have enough to respond to the growing data analysis requirements generated by the management site. In general, there is no better way for WEB developers to reflect data flow characteristics than to report various descriptive statistics. There are many inferential statistical steps (methods for estimating the overall parameters based on the sample data) that can be fully exploited, but are not currently applied.

For example, WEB access statistics, as currently edited, are just a number of frequency counts that are grouped in various ways. The results of opinion polls and surveys in terms of their original count and percentages abound.

The statistical analysis that developers use to handle data streams in a more superficial way may be sufficient, and we should not expect too much. After all, there are professionals involved in more complex data flow analysis, and they are statisticians and trained analysts. When organizations need more than descriptive statistics, they can invite them to join.

But another response is to admit that a growing understanding of inferential statistics is becoming part of the Web developer's job description. Dynamic sites are generating more and more data, and it turns out that trying to turn this data into useful knowledge is the responsibility of WEB developers and system administrators.

I advocate the latter response; This article is intended to help Web developers and system administrators learn (or revisit, if knowledge has forgotten) the design and analysis skills needed to apply inferential statistics to WEB data streams.

Making WEB data relevant to experimental design

Applying inferential statistics to WEB data streams requires more than just learning to be a mathematical knowledge based on a variety of statistical tests. The ability to correlate data collection processes with key differences in experimental design is also important: what is the measurement scale? What is the representativeness of the sample? What is the overall? What is the hypothesis being tested?

To apply inferential statistics to WEB data streams, it is necessary to think of the results as being generated by experimental design, and then select the analytical process that applies to the design of the experiment. Even though you may think that it is superfluous to consider WEB polls and access log data as the result of an experiment, it is really important to do so. Why?

1. This will help you to select the appropriate statistical test method.

2. This will help you draw appropriate conclusions from the data collected.

One important aspect of experimental design in determining which appropriate statistical tests to use is to select metrics for data collection.

Examples of metrics

The measurement scale simply specifies a step to assign a symbol, letter, or number to the phenomenon of interest. For example, the kilogram scale allows you to assign numbers to an object, indicating the weight of the object according to the standardized offset of the measuring instrument.

There are four important metrics:

The scaling scale (ratio)-kilogram scale is an example of a scaling scale? The symbols assigned to an object's properties have a numeric meaning. You can perform various operations on these symbols, such as calculation ratios, and you cannot use these operations for values obtained by using less powerful metrics.

Fixed-distance scale (interval)-The distance (also known as spacing) between any two adjacent units of measurement is equal at the fixed-distance scale, but 0 points are arbitrary. Examples of distance scales include measurements of longitude and tidal heights, and measurements of the beginning and the beginning of different years. The value of the fixed-distance scale can be added and decreased, but the multiplication and division is meaningless.

The sequence scale (rank)-order scale can be applied to a set of sequential data, in which the values and observations belonging to the scale can be ordered or accompanied by a rating scale. Common examples include "likes and dislikes" polls, which assign numbers to individual attributes (from 1 = very dislike to 5 = very much). Generally, a group of ordered data categories have a natural order, but the difference between adjacent points on the scale does not have to be the same. For sequential data, you can count and sort, but not measure.

Fixed-class scale (nominal)-a standard-determining scale is the weakest form of a measure, mainly referring to assigning items to groups or categories. This measurement has no quantitative information and does not indicate that the item is sorted. The primary numerical operation for a fixed-class scale data is the frequency count of items in each category.

The following table compares the characteristics of each of these metrics:

Does measuring standard scale attribute have absolute numerical meaning? Can you perform most mathematical operations?

The scaling scale is. Is.

The fixed distance scale is the same for the fixed distance scale, and 0 points are arbitrary. Plus and minus.

The order scale is not. Count and sort.

The definite class scale is not. can only count.

In this article, I will focus on the data collected by using a fixed scale of measurement, and the inference techniques that apply to fixed-class data.

Using a definite class scale

Almost all Web users-designers, customers, and system administrators-are familiar with the scaling scale. Web polls and access logs are similar because they often use a fixed-class metric as a metric. In Web polls, users often ask people to choose the answer option, such as "Do you prefer brand A, brand B, or brand C?" ") to measure people's preferences. Summarize the data by counting the frequency of each answer.

Similarly, the usual way to measure web traffic is to divide each click or visit within a one-week period into one day, and then count the number of clicks or accesses that appear each day. In addition, you can (and indeed can) count the hits by browser type, operating system type, and the country or region where the visitor is located-and any category scale you want.

Because Web polls and access statistics need to count the number of times the data is grouped into a particular category of properties, you can analyze them using a similar nonparametric statistical test (which allows you to make inferences based on the distribution shape rather than the overall parameters).

In his book Handbook of Parametric and Non-parametric statistical procedures (page 19th, 1997), David Sheskin distinguishes between parametric and nonparametric tests:

The differences used in this book to classify processes into parametric and nonparametric tests are based primarily on the level of measurement represented by the data being analyzed. As a general rule, the inference statistical tests for the assessment category/fixed-class scale data and order/rank-order data are classified as nonparametric tests, while those that evaluate the fixed-scale data or the calibration scale data are classified as parameter tests.

Nonparametric testing is also useful when some assumptions that are the basis for parameter testing are questionable; When the parameter assumptions are not met, the Nonparametric test has a great effect on the overall difference detection. For the Web poll example, I used the nonparametric parsing process, because WEB polls usually use a fixed-class scale to record voter preferences.

I'm not suggesting that Web polls and Web Access statistics should always use a standard metric, or that nonparametric statistical testing is the only way to analyze such data. It is not difficult to envisage (for example) such polls and surveys, which require the user to provide a numerical rating (from 1 to 100) for each option, so that the statistical examination of the parameters is more appropriate.

However, many WEB data streams include editing category count data, and by defining a fixed-distance scale (for example, from 17 to 21) and assigning each data point to a fixed scale (such as "Young people"), the data can be converted to a fixed-scale data (by using a more powerful measurement of metrics). The prevalence of frequency data (already part of WEB developer experience) makes focusing on nonparametric statistics a good starting point for learning how to apply inferential techniques to data flow.

To keep this article reasonable, I'll confine my discussion of Web data flow analysis to web polls. But keep in mind that many WEB data streams can be represented by a set class count data, and the inference technique I discuss will allow you to do more than report simple counting data.

Start with the sample

Let's say you have a weekly poll on your site www.NovaScotiaBeerDrinkers.com, asking members for advice on a variety of topics. You have created a poll asking members about their favorite beer brands (Nova Scotia in Nova Scotia Prov., Canada) with three well-known beer brands: Keiths, Olands and schooner. In order to make the survey as wide as possible, you include "other" in your answer.

You receive 1,000 answers, please observe the results in table 1. (The results shown in this article are for demonstration purposes only and are not based on any actual investigation.) ）

Table 1. Beer poll Keiths olands schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The data seem to support the conclusion that Keiths is the most popular brand among Nova Scotia Prov. residents. Based on these figures, can you come to this conclusion? In other words, can you infer from the results obtained from a sample the overall Nova Scotia Prov. beer consumers?

Many of the factors associated with sample collection methods make the inference of a relatively popular degree incorrect. The sample may contain too many employees of the Keiths brewery; You may not be able to completely prevent a person from casting a number of votes, and that person may make a difference in the outcome; perhaps the person chosen to vote differs from the person who has not been singled out for a vote; perhaps the voters on the internet are different from those who do not.

Most Web polls have difficulties with these explanations. These explanations are difficult when you try to draw conclusions about the overall parameters from the sample statistics. From an experimental design standpoint, the first question to ask before collecting data is whether steps can be taken to help ensure that the sample represents the overall study.

If the overall conclusion of the study is that you are motivated to do a Web poll (rather than a distraction for site visitors), then you should implement some techniques to ensure that one person and one vote (so they must log on with a unique identity to vote) and make sure to randomly select the voter sample (for example, randomly select a subset of members, Then email them and encourage them to vote.

Ultimately, the goal is to eliminate (at least reduce) deviations that may weaken the ability to draw conclusions on the overall study.

Test assumptions

Assuming that the Nova Scotia Prov. Beer Consumer statistics sample has not deviated, can you now draw the conclusion that Keiths is the most popular brand?

To answer this question, consider a related question: if you want to get a sample of another Nova Scotia Prov. beer consumer, do you want to see exactly the same result? In fact, you would expect that the results observed in different samples would change somewhat.

Considering this expected variability of the sample, you may wonder whether it is better to demonstrate the observed brand preference by randomly sampling variability than by reflecting the actual differences in the overall study. In statistical scholarly language, this sample variability description is called a false set (null hypothesis). (False set by symbol Ho) In this case, it is represented as a statement with a formula: The expected number of responses is the same in all categories of responses.

ho:# Keiths = # olands = # schooner = # Other

If you can rule out imaginary assumptions, you have made some progress in answering the initial question whether Keiths is the most popular brand. Another acceptable assumption, then, is that the percentages of responses are different in the overall study.

This "first Test false" logic applies in many stages of the analysis of poll data. Exclude this imaginary assumption so that the data will not be completely different, and then you can continue to examine a more specific false setting, namely Keiths and schooner, or Keiths and all other brands.

You continue to examine false sets rather than directly evaluate another hypothesis, because it is easier to model the things people want to see in the virtual assumptions. Next, I'll demonstrate how to model the things I expect in a false setting so that I can compare the observations with the results I expect in a virtual hypothesis.

Modeling a false setting: X-squared distribution statistics

So far, you have used a table that reports the frequency counts (and percentages) of each answer option to summarize the results of the WEB poll. To examine the false set (no difference between the frequency of the table cells), it is much easier to calculate the overall deviation metric for each table cell than you expect under virtual assumptions.

In this sample beer popularity poll, the expected frequencies under virtual assumptions are as follows:

Expected frequency = Number of observations/answer options

Expected frequency = 1000/4

Expected frequency = 250

To calculate the overall measure of how much of an answer in each cell differs from the expected frequency, you can sum all the differences to a total measure of how much difference between the observed frequency and the desired frequency: (285-250) + (250-250) + (215-250) + (250-250).

If you do this, you will find that the expected frequency is 0, because the average deviation is always 0. To solve this problem, you should take the square of all the difference values (this is the origin of the square in the X-squared distribution (Chi square). Finally, to make the value of each sample (which has different observations) comparable (in other words, standardizing it), divide the value by the desired frequency. Therefore, the formula for the X square distribution statistic is as follows ("O" means "observation frequency", "E" equals "desired frequency"):

Figure 1. The formula of X square distribution statistics

If you calculate the X-squared distribution of the beer popularity poll data, you get a value of 9.80. To test the virtual hypothesis, it is necessary to know the probability of obtaining such a limit if there is a random sampling variability. To get this probability, we need to understand what the sampling distribution of the X-squared distribution is.

Observe the sampling distribution of the X-squared distribution

Figure 2. X Square Distribution Chart

In each diagram, the horizontal axis represents the resulting X square distribution value (the range shown in the figure is 0 to 10). The vertical axis shows the probability of each X squared distribution value (or the relative frequency of occurrence).

When you look at these X-squared maps, be aware that the shape of the probability function changes when you change the degree of freedom (that is, DF) in the experiment. For an example of a poll data, the degree of freedom is calculated by taking the number of answer options (k) in the poll and then using this value minus 1 (df = k-1).

Typically, when you increase the number of answer options in an experiment, the probability of getting a larger X-squared distribution is reduced. This is because when you increase the answer option, you increase the number of variance values-(observation-expected value) 2-you can ask for its total. Therefore, when you increase the answer option, the statistical probability of getting a large X square value should be increased, while the probability of obtaining a smaller X-squared distribution is reduced. This is why the shape of the sample distribution of the X-squared distribution varies with the DF value.

Also, note that people are usually not interested in the decimal portion of the X squared distribution result, but rather on the total portion of the curve to the right of the obtained value. The mantissa probability tells you whether it is possible (such as a large mantissa area) to obtain a limit value like the one you have observed or not (a small mantissa area). (Actually, I don't use these graphs to compute the mantissa probability, because I can implement a mathematical function to return the mantissa probability for a given X squared distribution.) This approach is used in the X square distribution program that I discussed later in this article. ）

To learn more about how these graphs derive, you can see how to simulate the contents of a graph corresponding to DF = 2, which represents k = 3. Imagine putting numbers 1, 2, and 3 in a hat, shaking it, selecting a number, and then recording the selected number as an attempt. Try the experiment 300 times, and then calculate the frequency of the 1, 2, and 3 occurrences.

Each time you do this experiment, you should expect the results to have a slightly different frequency distribution, which reflects the variability of the sample, and this distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can initialize the class with the following values: The number of times to experiment, the number of attempts in each experiment, and the number of options per trial. The results of each experiment are recorded in an array named outcomes.

Listing 1. Content of the Multinomial class

<?php

multinomial.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Class Multinomial {

var $NExps;

var $NTrials;

var $NOptions;

var $Outcomes = array ();

function multinomial ($NExps, $NTrials, $NOptions) {

$this->nexps = $NExps;

$this->ntrials = $NTrials;

$this->noptions = $NOptions;

For ($i =0 $i < $this->nexps; $i + +) {

$this->outcomes[$i] = $this->runexperiment ();

}

}

function Runexperiment () {

$Outcome = Array ();

for ($i = 0; $i < $this->nexps; $i + +) {

$choice = rand (1, $this->noptions);

$Outcome [$choice]++;

}

return $Outcome;

}

}

?>

Note that the Runexperiment method is a very important part of the script, and it guarantees that the choices made in each experiment are random and track what choices have been made in the simulation experiments so far.

In order to find the sampling distribution of the X-squared distribution statistics, only the results of each experiment are obtained and the X-squared distribution statistics of the results are computed. Because of the variability of random sampling, the X-squared distribution statistics vary with the experiment.

The following script writes the X-squared distribution statistics obtained in each experiment to an output file to be represented later in the chart.

Listing 2. Writes the obtained X squared distribution statistics to the output file

<?php

simulate.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Set time limit to 0 so script doesn ' t time out

Set_time_limit (0);

Require_once ". /init.php ";

Require Php_math. "Chi/multinomial.php";

Require Php_math. "Chi/chisquare1d.php";

Initialization parameters

$NExps = 10000;

$NTrials = 300;

$NOptions = 3;

$multi = new Multinomial ($NExps, $NTrials, $NOptions);

$output = fopen ("./data.txt", "w") OR die ("File won ' t open");

for ($i =0; $i < $NExps; $i + +) {

For each multinomial experiment, does chi Square analysis

$chi = new Chisquare1d ($multi->outcomes[$i]);

Load obtained Chi square value into sampling distribution array

$distribution [$i] = $chi->chisqobt;

Write obtained Chi square value to file

Fputs ($output, $distribution [$i]. " n ");

}

Fclose ($output);

?>

To visualize the desired results of running this experiment, the easiest way for me to do this is to load the Data.txt file into the open source statistics packet R, run the histogram command, and edit the chart in the graphics editor as follows:

x = Scan ("Data.txt")

hist (x, 50)

As you can see, the histogram of these X-squared distributions is similar to the distribution of the continuous x-squared distribution of DF = 2 above.

Figure 3. The value of the continuous distribution approximation with the df=2

In the following sections, I will focus on how the X-squared distribution software used in the simulation experiment works. Typically, the X-squared distribution software is used to analyze the actual fixed-class scale data (such as WEB poll results, weekly traffic reports, or customer brand preference reports) rather than the analog data you use. You may also be interested in other outputs generated by the software-such as summary tables and mantissa probabilities.

Instance variables of the X square distribution

The PHP-based X-squared distribution package I developed consists of classes used to analyze frequency data, which are classified according to one-dimensional or two-D (chisquare1d.php and chisquare2d.php). My discussion will be limited to explaining how the Chisquare1d.php class works and how to apply it to one-dimensional Web poll data.

Before continuing, it should be explained that classifying data by two-D (for example, by sex to classify beer preferences) allows you to start by looking for system relationships or conditional probabilities in a column-table cell to illustrate your results. Although many of the following discussions will help you understand how the chisquare2d.php software works, other experiments, analysis, and visualization issues that are not discussed in this article have to be addressed before using this class.

Listing 3 studies the fragment of the Chisquare1d.php class, which is composed of the following parts:

1. A contained document

2. Class instance variables

Listing 3. Fragment of the X-squared distribution class with the contained file and instance variables

<?php

chisquare1d.php

Copyright 2003, Paul Meagher

Distributed under LGPL

Require_once Php_math. "Dist/distribution.php";

Class Chisquare1d {

var $Total;

var $ObsFreq = array (); Observed frequencies

var $ExpFreq = array (); Expected frequencies

var $ExpProb = array (); Expected probabilities

var $NumCells;

var $ChiSqObt;

var $DF;

var $Alpha;

var $ChiSqProb;

var $ChiSqCrit;

}

?>

The top part of this script in Listing 3 contains a file named distribution.php. The included paths combine the Php_math constants set in the init.php file, assuming that the init.php file is already contained in the calling script.

The included file distribution.php contains a method for generating sampling distribution statistics for several commonly used sampling distributions (T, F, and X Square distributions). The chisquare1d.php class must be able to access the X square distribution method in the distribution.php to calculate the mantissa probability of the resulting x-squared distribution value.

The list of instance variables in this class is noteworthy because they define the result objects that are generated by the profiling process. This result object contains all the important details about the test, including three important X-squared distribution statistics-CHISQOBT, Chisqprob, and Chisqcrit. For more information about how to calculate each instance variable, you can look at the constructor method for that class, all of which are from there.

Constructors: The backbone of the X-squared distribution test

Listing 4 shows the constructor code for the X squared distribution, which forms the backbone of the X square distribution test.

Listing 4. The constructor of the X square distribution

<?php

Class Chisquare1d {

function chisquare1d ($ObsFreq, $Alpha =0.05, $ExpProb =false) {

$this->obsfreq = $ObsFreq;

$this->expprob = $ExpProb;

$this->alpha = $Alpha;

$this->numcells = count ($this->obsfreq);

$this->DF = $this->numcells-1;

$this->total = $this->gettotal ();

$this->expfreq = $this->getexpfreq ();

$this->chisqobt = $this->getchisqobt ();

$this->chisqcrit = $this->getchisqcrit ();

$this->chisqprob = $this->getchisqprob ();

return true;

}

}

?>

The four aspects worth noting in the constructor method are:

1. The constructor accepts an array of observed frequencies, an alpha probability fracture point (cutoff score), and an optional expected probability.

2. The first six lines involve relatively simple assignments and recorded computed values so that the full result object can be used to invoke the script.

3. The last four lines perform a large amount of work to get the X-squared distribution statistics that are most interesting to you.

4. This class implements only the X-squared distribution test logic. There is no output method associated with the class.

You can study the class methods included in the code download for this article to learn more about how to calculate values for each result object (see Resources).

Handling Output problems

The code in Listing 5 shows how easy it is to perform an X-squared distribution analysis using the Chisquare1d.php class. It also demonstrates the handling of the output problem.

The script invokes a wrapper script named chisquare1d_html.php. The purpose of this wrapper script is to separate the logic of the X squared distribution process from its representation. The _html suffix indicates that the output is for a standard Web browser or other device that displays HTML.

Another purpose of the wrapper script is to organize the output in a way that facilitates understanding of the data. To achieve this, the class contains two methods for displaying the results of the X square distribution analysis. The Showtablesummary method shows the first output table shown after the Code (table 2), while the showchisquarestats shows the second output table (table 3).

Listing 5. Organizing data with wrapper scripts

<?php

beer_poll_analysis.php

Require_once ". /init.php ";

Require_once Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (285, 250, 215, 250);

$Alpha = 0.05;

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha);

$Chi->showtablesummary ($Headings);

echo "<br><br>";

$Chi->showchisquarestats ();

?>

The script produces the following output:

Table 2. Expected frequency and variance obtained by running wrapper script

Keiths olands Schooner Other Total

Observation Value 285 250 215 250 1000

Expected 250 250 250 250 1000

Variance 4.90 0.00 4.90 0.00 9.80

Table 3. Statistics for various X-squared distributions obtained by running the wrapper script

Probability critical value of DF obtained value

X Square Distribution 3 9.80 0.02 7.81

Table 2 shows the expected frequency and the variance metric (O-E) 2/e for each cell. The sum of the variance values is equal to the obtained X square distribution (9.80) value, which is displayed in the lower-right cell of the summary table.

Table 3 reports various X-squared distribution statistics. It includes the degrees of freedom used in the analysis, and again reports the obtained X squared distribution value. The obtained X-squared distribution value is again expressed as the mantissa probability value-in this case 0.02. This means that, under imaginary assumptions, the probability of an X-squared limit of 9.80 is observed to be 2% (this is a fairly low probability).

If you decide to exclude false sets-the results can be obtained by a random sampling of 0 distributions, then most statisticians will not be controversial. Your poll results are more likely to reflect the real difference between the Nova Scotia Prov. Beer Consumer's overall preference for beer brands.

To confirm this conclusion, we can compare the value of the obtained X-squared distribution with the critical value.

Why is critical value important? The threshold value is based on an important level (that is, the Alpha disconnect level) set for the analysis. The Alpha fracture value is set to 0.05 by convention (this value is used for the above analysis). This setting is used to find the position (or critical value) of the sample distribution of the X square distribution that contains the Mantissa region equal to the Alpha fracture value (0.05).

In this paper, the value of the X squared distribution is greater than the critical value. This means that the threshold is exceeded to maintain the false set description. Another assumption-there is a proportional difference in the overall object-is probably more accurate statistically.

In the automated analysis of data streams, alpha-disconnect settings can set output filtering for knowledge-discovery algorithms (such as the X-squared distribution automatic interaction detection (Chi Square Automatic interaction detection,chiad)). Such an algorithm itself is unable to provide detailed guidance on discovering really useful patterns.

A new poll.

Another interesting application of one-way X-square distribution testing is to re-examine polls to see if people's responses have changed.

Suppose that after a while you intend to conduct another Web poll of Nova Scotia Prov. beer consumers. Once again you ask about their favorite beer brands and now observe the following results:

Table 4. The new beer poll

Keiths olands Schooner Other

385 (27.5%) 350 (25%) 315 (22.5%) 350 (25%)

The old data looks like this:

Table 1. Old Beer poll (again shown)

Keiths olands Schooner Other

285 (28.5%) 250 (25%) 215 (21.5%) 250 (25%)

The obvious difference between the polls was that there were 1,000 respondents in the first poll and 1,400 respondents for the second time. The main effect of these additional respondents was to increase the frequency count of each response to 100 points.

When you are ready to analyze new polls, you can use the default method to calculate the expected frequency to analyze the data, or you can initialize the analysis with the expected probability of each result (based on the proportions observed in the previous poll). In the second scenario, you load the proportions that were previously obtained into the expected probability array ($ExpProb) and use them to compute the expected frequency values for each answer option.

Listing 6 shows the beer poll analysis code for detecting preference changes:

Listing 6. Changes in detection preferences

<?php

beer_repoll_analysis.php

Require_once ". /init.php ";

Require Php_math. "Chi/chisquare1d_html.php";

$Headings = Array ("Keiths", "Olands", "schooner", "other");

$ObsFreq = Array (385, 350, 315, 350);

$Alpha = 0.05;

$ExpProb = Array (. 285,. 250,. 215,. 250);

$Chi = new Chisquare1d_html ($ObsFreq, $Alpha, $ExpProb);

$Chi->showtablesummary ($Headings);

echo "<br><br>";

$Chi->showchisquarestats ();

?>

Tables 5 and 6 show the HTML output generated by the beer_repoll_analysis.php script:

Table 5. Expected frequency and variance for running beer_repoll_analysis.php

Keiths olands Schooner Other Total

Observation value 385 350 315 350 1400

Expected 399 350 301 350 1400

Variance 0.49 0.00 0.65 0.00 1.14

Table 6. Statistic statistics of various X-square distributions obtained by running beer_repoll_analysis.php

Probability critical value of DF obtained value

X Square Distribution 3 1.14 0.77 7.81

Table 6 shows that the probability of obtaining an X-squared distribution value of 1.14 is 77% under the condition of virtual assumption. We cannot rule out such a false setting, that is, since the last poll, Nova Scotia Prov. Beer consumer preferences have changed. Any difference between the observed frequency and the expected frequency can be explained as the expected sampling variability of the same beer consumer Nova Scotia Prov.. Given that the conversion of the original poll results was done only by adding a constant 100 to each of the previous polls, there should be no surprise in this zero discovery.

However, you can assume that the results have changed and assume that the results may imply that another brand of beer is becoming more popular (note the variance size reported at the bottom of each column in table 5). You can further imagine that this finding has significant implications for the financial aspects of the brewery that is being discussed, since bar bosses tend to buy the best selling beers in pubs.

These results will be examined in great detail by the brewery bosses, who question the suitability of the analytical process and the experimental method, and, in particular, question the representativeness of the sample. If you plan to do a WEB experiment that might have important practical implications, you need to pay equal attention to the experimental methods used to collect data and the analytical techniques used to infer from the data.

Therefore, this article not only lays a good foundation for you to enhance your effective understanding of WEB data, it also provides recommendations on how to protect your statistical test selections and makes the conclusions obtained from the data more reasonable.

The knowledge learned from the application

In this article, you've learned how to apply inferential statistics to the ubiquitous frequency data used to summarize web data streams, focusing on the analysis of web poll data. However, the simple one-way X-squared distribution analysis process is also effectively applied to other types of data flows (access logs, survey results, customer profiles, and customer orders) to transform the raw data into useful knowledge.

When applying inferential statistics to Web data, I also introduced the need to view data streams as results of web experiments, so as to increase the likelihood of referencing experimental design considerations when making inferences. Usually you don't have enough control over the process of data acquisition, so you can't make inferences. However, if you are more proactive when you apply the design principles of the experiment to the Web data collection process (for example, by randomly selecting voters in your Web poll process), you can change this situation.

Finally, I showed how to simulate the sampling distribution of the X-squared distributions of different degrees of freedom, not just the source. In doing so, the expected frequency for the measurement category is less than 5 (in other words, the small N experiment)-I also demonstrated a workaround (using a sampling distribution of small $NTrials values to simulate experiments) to prohibit the use of the X-squared distribution test. Therefore, I do not only use DF in the research process to calculate the probability of the sample results, for a small number of attempts, may also need to use the $NTrials value as a parameter to obtain the observed X square distribution of the probability of the result.

It is worthwhile to consider how you might analyze small N experiments, because you might normally want to analyze your data before the data is complete-when the cost of each observation is expensive, it takes a long time to get it, or just because you're curious. When you try this level of WEB data analysis, it's a good idea to keep the following two questions in mind:

* Do you have any reason to infer under small N conditions?

* Does the simulation help you decide what inferences to get in these environments?

Related Article