Use PHP to make Web data analysis more advanced

Source: Internet
Author: User
Tags php software
Designing your data analysis and performing effective and multi-level analysis on Web data is a key factor for the survival of many Web enterprises, the design (and decision-making) of data analysis tests is usually the work of system administrators and internal application designers who, in addition to being able to make original counts into tables, there is no more design for your data analysis for statistics, and you do more than a simple original count

Effective and multi-level analysis of Web data is a key factor for the survival of many Web-oriented enterprises. design (and decision-making) of data analysis and inspection) it is usually the work of system administrators and internal application designers who do not know much about statistics except for being able to make the original count into a table. In this article, Paul Meagher taught Web developers the skills and concepts required to apply inference statistics to Web data streams.

Dynamic websites constantly generate a large amount of data-access logs, polls and survey results, customer summary information, orders, and others. Web developers not only create applications that generate such data, it also needs to develop applications and methods that make these data streams meaningful.

Generally, Web developers are not enough to cope with the increasing data analysis requirements generated by the management site. In general, in addition to reporting various descriptive statistics, Web developers do not have any better way to reflect data stream features. Many statistical inference steps (methods for estimating the overall parameters based on sample data) can be fully utilized, but they are not currently applied.

For example, the Web access statistics (edited by the current time) are only counted by frequency of grouping in various ways. The original count and percentage indicate that polls and survey results are common.

It may be enough for developers to process the statistical analysis of data streams in a simple way and we should not expect too much. After all, there are professionals engaged in more complex data flow analysis; they are statisticians and trained analysts. When organizations require more than descriptive statistics, they can be invited to join.

However, another response is to acknowledge that the increasingly deep understanding of inference statistics is becoming part of the work description of Web developers. Dynamic websites are generating more and more data. it turns out that it is the responsibility of Web developers and system administrators to try to turn the data into useful knowledge.

I advocate the latter approach. This article aims to help Web developers and system administrators learn (or review, if knowledge is forgotten) design and analysis skills required to apply inferential statistics to Web data streams.


Associate Web data with lab design

Applying inferential statistics to Web data streams requires not only learning mathematical knowledge as the basis for various statistical tests. The ability to associate the data collection process with key differences in the experiment design is equally important: what is the measurement scale? How is the sample representative? What is overall? What are the assumptions being tested?

To apply inference statistics to Web data streams, you must first regard the results as generated by the experiment design, and then select the analysis process suitable for the experiment design. This is important even if you think that Web polls and access log data are the result of an experiment. Why?

1. this will help you select an appropriate statistical testing method.
2. this will help you draw appropriate conclusions from the collected data.

When determining which appropriate statistical tests should be used, an important aspect of the experiment design is to select a measurement scale for data collection.

  Measure examples

The measurement scale only specifies a step for allocating symbols, letters, or numbers to phenomena of interest. For example, the kilogram scale allows you to assign a number to an object to indicate its weight based on the standardized offset of the measuring instrument.

There are four important metrics:

What is the ratio-kilogram scale as an example? The symbol assigned to the object property is numerical. You can perform various operations (such as rate calculation) on these symbols, but you cannot use these operations for values obtained by using less powerful metrics.


Interval: the distance (also known as spacing) between any two adjacent measurement units is equal, but the zero point is arbitrary. The following example shows the longitude and tidal height measurements and the start and end measurements of different years. The value of the distance scale can be added or subtracted, but multiplication or division is meaningless.


Rank-a sequence scale can be applied to a group of ordered data. a sequence refers to the values and observed values of the scale in order or with a rating scale. Common examples include the "good and evil" polls, where numbers are assigned to various attributes (from 1 = very disgusted to 5 = very fond ). Generally, the categories of ordered data are in a natural order, but the gaps between adjacent points on the scale do not always have to be the same. You can count and sort ordered data, but cannot measure it.


Nominal-the class scale of a metric is the weakest form of the metric. it mainly refers to assigning a project to a group or category. This measurement does not contain the quantity information and does not represent sorting items. The main numeric operation is performed on the metric data of each category.

The following table compares the features of each measurement criterion:

Does a standard scale attribute have an absolute numerical meaning? Can most mathematical operations be performed?
The ratio is. Yes.
The same is true for the fixed-distance scale. the zero point is arbitrary. Add and subtract.
The sequencing scale is not. Count and sort.
The class scale is not. Only count.

In this article, I will mainly discuss the data collected by using the measurement class scale and the inference technology applicable to the class data.


Use a specific class scale

Almost all Web users-designers, customers, and system administrators-are familiar with class-based standards. Web polls are similar to access logs because they often use a specific class scale as a metric. In Web opinion polls, users often ask people to select answer options (for example, "Do you prefer brand A, Brand B, or brand C ?") To measure people's preferences. Collects data by counting the frequency of various answers.

Similarly, a common method to measure website traffic is to divide each click or access in a day of a week into this day, and then count the number of clicks or visits that appear on each day. In addition, you can (or indeed) count clicks by browser type, operating system type, and the country or region of the visitor-and any classification scale you want.

Because both the Web polls and access statistics need to count the number of times the data falls into a specific type, therefore, you can use similar non-parametric statistical tests (allowing you to make inferences based on the distribution shape rather than the overall parameter) to analyze them.

In his book "Handbook of Parametric and Non-Parametric Statistical Procedures" (page 19th, 1997), David Sheskin distinguishes between parameter checks and Non-parameter checks:

In this book, the differences between process classification and parameter verification are mainly based on the measurement level represented by the analyzed data. As a general rule, the statistical inference test of evaluation class/fixed class scale data and sequence/level-ordered data is classified as non-parameter tests, the tests that evaluate the specific or specific scale data are classified as parameter tests.

Non-parameter tests are also useful when some assumptions that are used as the basis for parameter tests are questionable. when parameter assumptions are not met, non-parameter tests play a major role in detecting the overall difference. For examples of Web opinion polls, I used a non-parameter analysis process because Web opinion polls typically use a set scale to record voter preferences.

I am not suggesting that Web opinion polls and Web access statistics should always use fixed-scale metrics, or non-parameter statistical tests are the only method that can be used to analyze such data. It is not hard to imagine that there are (for example) such polls and surveys that require users to provide a numerical score (from 1 to 100) for each option, so the parametric statistical test is more appropriate.

Despite this, many Web data streams include editing the category count data and defining a specific distance (for example, from 17 to 21) and assign each data point to a specific scale (such as "young people"), which can be converted into a specific scale data by using more powerful measurements. The prevalence of frequency data (already part of Web developers' experience) makes it a good starting point to learn how to apply inference technology to data streams by focusing on non-parametric statistics.

In order to keep this article reasonable, I will limit the discussion on Web data stream analysis to Web opinion polls. However, remember that many Web data streams can be expressed by fixed-class counting data, and the inference technology I discussed will allow you to do more than simply report counting data.


Starting from sampling

Assume that you perform a weekly public opinion test on your site www.NovaScotiaBeerDrinkers.com to ask your comments on various topics. You have created a public opinion test and asked about the members' favorite beer brands (three famous beer brands in Nova Scotia, Canada: Keiths, Olands, and Schoner ). To make the survey as extensive as possible, you include "others" in your answers ".

You have received 1,000 answers. please observe the results in Table 1. (The results shown in this article are only used for demonstration and are not based on any actual investigation .)

Table 1. beer public opinion test Keiths Olands Schoner others
285 (28.50%) 250 (25.00%) 215 (21.50%) 250 (25.00%)

The data seems to support the conclusion that Keiths is the most popular brand among the residents of the new Skeys. Can you draw this conclusion based on these numbers? In other words, can you deduce the overall population of beer consumers in Nova Skeys based on the results obtained from the samples?

Many factors related to the way samples are collected may lead to incorrect inferences of a relatively popular degree. The sample may contain too many Keiths brewery employees; you may not have completely prevented a person from voting for multiple times, and this person may be able to cause deviations; the people selected to vote may be different from those not selected to vote. the online voters may be different from those who do not.

Most Web polls have difficulties in interpreting these questions. When you try to draw conclusions about the overall parameters from the sample statistical data, these explanations may be difficult. From the perspective of experimental design, the first question to be asked before data collection is: can we take steps to help ensure that the sample can represent the overall study.

If you come to the overall conclusion that you are motivated to do a Web polls (rather than provide entertainment for site visitors), you should implement some technologies, to ensure that one person and one vote (so they must log on with a unique identifier to vote), and to ensure random selection of voter samples (for example, random selection of a subset of members, and then send them an email to encourage them to vote ).

Ultimately, the goal is to eliminate (at least reduce) various deviations, which may weaken the ability to draw conclusions on the overall study.


Test hypothesis

Assuming that there is no deviation in the samples of beer consumer statistics in Nova Skeys, can you come to the conclusion that Keiths is the most popular brand?

To answer this question, please consider a related question: If you want to get a sample of another beer consumer in Nova Skeys, do you want to see the same results? In fact, you want to change the results observed in different samples.

Considering this expected sampling variability, you may doubt whether the observed brand preferences are better presented by random sampling variability than reflecting the actual differences in the study population. In statistical terms, this sample variability is known as null hypothesis ). In this example, use a formula to express the statement as follows: the expected number of answers is the same for all categories of answers.

Ho: # Keiths = # Olands = # Schoner = # Other

If you can exclude false settings, you have made some progress in answering the initial question about whether Keiths is the most popular brand. Another acceptable assumption is that the proportion of various answers varies in the overall study.

This "first test false settings" logic applies in multiple stages of public opinion test data analysis. Eliminate this false set so that the data will not be completely different. then you can continue to test a more specific false set, namely, Keiths and Schoner, or there is no difference between Keiths and all other brands.

You continue to test false settings rather than directly evaluating another hypothesis because it is easier to conduct statistical modeling for things that people want to observe under false conditions. Next, I will demonstrate how to model the expected things under false settings, so that I can compare the observed results with the expected results under false settings.


Modeling of false settings: X-squared distribution statistics

So far, you have used a table to summarize the results of Web polls by reporting the frequency counts (and percentages) of each answer option. To test the false settings (there is no difference between the form element frequency), it is much easier to calculate the overall deviation between each form element and the expected value under the false settings.

In this example of beer popularity polls, the expected frequencies under false conditions are as follows:

Expected frequency = number of observations/Number of answer options
Expected frequency = 1000/4
Expected frequency = 250

To calculate the overall measure of the difference between the answer content in each unit and the expected frequency, you can aggregate all the differences to an overall measure that reflects the difference between the observed frequency and the expected frequency: (285-250) (250-250) (215-250) (250-250).

If you do this, you will find that the expected frequency is 0, because the sum of the mean deviation is always 0. To solve this problem, we should take the Square of all the difference values (this is the origin of the Square in the X-Square distribution (Chi Square ). Finally, to make the value of each sample (which has different observed numbers) comparable (in other words, standardize it), divide the value by the expected frequency. Therefore, the formula for X-squared distribution statistics is as follows ("O" indicates "observed frequency", and "E" equals "expected frequency "):

Figure 1. X square distribution statistics formula


If you calculate the X-square distribution of beer popularity polls, you will get a value of 9.80. To test false settings, you need to know the probability of getting such a limit value under the assumption of random sampling variability. To obtain this probability, we need to understand the sampling distribution of the X-square distribution.


Observe the sampling distribution of X-square distribution

Figure 2. X square chart


In each graph, the horizontal axis indicates the size of the X-square distribution value (the range shown in the figure is from 0 to 10 ). The vertical axis shows the probability (or relative frequency of occurrence) of each X-squared distribution value ).

When studying these X-square distributions, note that when you change degrees of freedom (df) in an experiment, the shape of the probability function changes. For example, the degree of freedom is calculated as follows: write down the number of answer options (k) in the polls and use this value to subtract 1 (df = k-1 ).

Generally, when you increase the number of answer options in an experiment, the probability of obtaining a larger X-square distribution is decreased. This is because when the answer option is added, the number of difference values is increased-(observed value-expected value) 2-you can calculate the total number of difference values. Therefore, when you increase the answer option, the statistical probability of obtaining a large X-square distribution value should be increased, and the probability of obtaining a smaller X-square distribution value will be reduced. This is why the shape of the sampling distribution of X square distribution changes with the df value.

In addition, note that people are usually not interested in the decimal point of the X-square distribution result, but are interested in the total part of the curve on the right of the obtained value. This ending probability tells you whether it is possible to obtain a limit value (such as a large ending number area) or not (a small ending number area ). (In fact, I don't use these graphs to calculate the probability of the ending number, because I can implement mathematical functions to return the probability of the Ending number of the given X-square distribution value. I will use this method in the X-square distribution program discussed later in this article .)

To learn more about how these graphs are derived, you can see how to simulate the graph content corresponding to df = 2 (which represents k = 3. Imagine putting numbers 1, 2, and 3 in the hat, shake it, select a number, and record the selected number as an attempt. Perform 300 attempts for this experiment and calculate the frequencies of 1, 2, and 3.

Each time you do this experiment, you should expect a slightly different frequency distribution of the results, which reflects the variability of the sampling, while the distribution does not really deviate from the possible probability range.

The following Multinomial class implements this idea. You can use the following values to initialize the class: the number of labs to be tested, the number of attempts made in each experiment, and the number of options for each test. The results of each experiment are recorded in an array named Outcomes.

Listing 1. Multinomial class content



// Multinomial. php

// Copyright 2003, Paul Meagher
// Distributed under LGPL

Class Multinomial {

Var $ NExps;
Var $ NTrials;
Var $ NOptions;
Var $ Outcomes = array ();

Function Multinomial ($ NExps, $ NTrials, $ NOptions ){
$ This-> NExps = $ NExps;
$ This-> NTrials = $ NTrials;
$ This-> NOptions = $ NOptions;
For ($ I = 0; $ I <$ this-> NExps; $ I ){
$ This-> Outcomes [$ I] = $ this-> runExperiment ();
}
}

Function runExperiment (){
$ Outcome = array ();
For ($ I = 0; $ I <$ this-> NExps; $ I ){
$ Choice = rand (1, $ this-> NOptions );
$ Outcome [$ choice];
}
Return $ Outcome;
}

}
?>



Note that the runExperiment method is a very important part of the script and ensures that the choices made in each experiment are random, we also track the choices made in the simulation experiment so far.

To find the sampling distribution of the X-square distribution statistics, you only need to obtain the results of each experiment and calculate the X-square distribution statistics of the results. Due to the variability of random sampling, the X-square distribution statistics vary with the experiment.

The following script writes the X-square distribution statistics obtained by each experiment to an output file for later presentation in charts.

Listing 2. write the obtained X-square distribution statistics to the output file


// Simulate. php

// Copyright 2003, Paul Meagher
// Distributed under LGPL

// Set time limit to 0 so script doesn't time out
Set_time_limit (0 );

Require_once "../init. php ";
Require PHP_MATH. "chi/Multinomial. php ";
Require PHP_MATH. "chi/ChiSquare1D. php ";

// Initialization parameters
$ NExps = 10000;
$ NTrials = 300;
$ NOptions = 3;

$ Multi = new Multinomial ($ NExps, $ NTrials, $ NOptions );

$ Output = fopen ("./data.txt", "w") OR die ("file won't open ");
For ($ I = 0; $ I <$ NExps; $ I ){
// For each multinomial experiment, do chi square analysis
$ Chi = new ChiSquare1D ($ multi-> Outcomes [$ I]);

// Load obtained chi square value into sampling distribution array
$ Distribution [$ I] = $ chi-> ChiSqObt;

// Write obtained chi square value to file
Fputs ($ output, $ distribution [$ I]. "n ");
}
Fclose ($ output );

?>



To visualize the expected results of running the experiment, the simplest method for me is to load the data.txt file into the open source statistics package R and run the histogram command, edit the chart in the graphic editor as follows:

X = scan ("data.txt ")
Hist (x, 50)

As you can see, the histogram of these X-square distribution values is similar to the continuous X-square distribution of the df = 2 shown above.

Figure 3. values close to the continuous distribution of df = 2


In the following sections, I will focus on the working principles of the X-squared distribution software used in this simulation experiment. In general, the X-square distribution software will be used to analyze the actual customized scale data (such as Web polls, weekly traffic reports, or customer brand preference reports ), instead of the simulated data you use. You may also be interested in other outputs generated by the software, such as the summary table and tail probability.

   X-square-distributed instance variables

My php-based X-squared distribution package consists of classes used to analyze frequency data. frequency data is classified by one or two dimensions (ChiSquare1D. php and ChiSquare2D. php. My discussion will be limited to explaining how the ChiSquare1D. php class works and how to apply it to one-dimensional Web polls.

Before proceeding, it should be noted that data is classified by two dimensions (for example, beer preferences are classified by gender ), you can start to describe your results by looking for system relationships or conditional probabilities in a join table unit. Although many of the discussions below will help you understand the working principles of the ChiSquare2D. php software, other lab, analysis, and visualization issues not discussed in this article must be addressed before using this class.

Listing 3 illustrates the ChiSquare1D. php class, which consists of the following parts:

1. an included file
2. class instance variables

Listing 3. fragment of the X-square distribution class with included files and instance variables


// ChiSquare1D. php

// Copyright 2003, Paul Meagher
// Distributed under LGPL

Require_once PHP_MATH. "dist/Distribution. php ";

Class ChiSquare1D {

Var $ Total;
Var $ ObsFreq = array (); // Observed frequencies
Var $ ExpFreq = array (); // Expected frequencies
Var $ ExpProb = array (); // Expected probabilities
Var $ NumCells;
Var $ ChiSqObt;
Var $ DF;
Var $ Alpha;
Var $ ChiSqProb;
Var $ ChiSqCrit;

}

?>

In listing 3, the top of the script contains a file named Distribution. php. The included path combines the PHP_MATH constant set in the init. php file. it is assumed that the init. php file is included in the call script.

The file Distribution. php contains a method to generate statistical information on sampling Distribution for several common sampling distributions (T Distribution, F Distribution, and X square Distribution. The ChiSquare1D. php class must be able to access the X-square Distribution method in Distribution. php to calculate the tail probability of the X-square Distribution value.

The list of instance variables in this class is worth noting because they define the result objects generated by the analysis process. This result object contains all the important details about the test, including three important X-square distribution statistics-ChiSqObt, ChiSqProb, and ChiSqCrit. For details about how to calculate each instance variable, refer to the constructor method of this class. all these values are from there.


Constructor: trunk of the X-square distribution test

Listing 4 shows the constructor code for the X-square distribution, which forms the backbone of the X-square distribution test.

Listing 4. X-squared constructor


Class ChiSquare1D {

Function ChiSquare1D ($ ObsFreq, $ Alpha = 0.05, $ ExpProb = FALSE ){
$ This-> ObsFreq = $ ObsFreq;
$ This-> ExpProb = $ ExpProb;
$ This-> Alpha = $ Alpha;
$ This-> NumCells = count ($ this-> ObsFreq );
$ This-> DF = $ this-> NumCells-1;
$ This-> Total = $ this-> getTotal ();
$ This-> ExpFreq = $ this-> getExpFreq ();
$ This-> ChiSqObt = $ this-> getChiSqObt ();
$ This-> ChiSqCrit = $ this-> getChiSqCrit ();
$ This-> ChiSqProb = $ this-> getChiSqProb ();
Return true;
}

}

?>

Note the following four aspects in the constructor method:

1. the constructor accepts an array consisting of observed frequencies, a cutoff score, and an optional array of expected probabilities.
2. the first six rows involve relatively simple values assigned and recorded values, so that the complete result object can be used to call scripts.
3. perform a large number of operations to obtain the X-square-distribution statistics in the last four rows. these statistics are of the greatest interest to you.
4. this class only implements the X-square distribution test logic. There is no output method associated with this class.
You can study the class methods included in the code download in this article to learn more about how to calculate the value of each result object (see references ).

   Handle output problems

The code in listing 5 shows how easy it is to use the ChiSquare1D. php class to perform an X-squared distribution analysis. It also demonstrates how to handle output problems.

This script calls a package script named ChiSquare1D_HTML.php. The purpose of this wrapper script is to separate the logic of the X-square distribution process from its representation. _ HTML suffix indicates that the output target a standard Web browser or other HTML display devices.

Another purpose of the wrapper script is to organize the output in a way that is easy to understand data. To achieve this purpose, this class contains two methods for displaying the analysis results of X square distribution. The showTableSummary method shows the first output table (Table 2) after the code, while showChiSquareStats shows the second output table (Table 3 ).

Listing 5. using a package script to organize data


// Beer_poll_analysis.php

Require_once "../init. php ";

Require_once PHP_MATH. "chi/ChiSquare1D_HTML.php ";

$ Headings = array ("Keiths", "Olands", "Schoner", "Other ");

$ ObsFreq = array (285,250,215,250 );
$ Alpha = 0.05;
$ Chi = new ChiSquare1D_HTML ($ ObsFreq, $ Alpha );

$ Chi-> showTableSummary ($ Headings );
Echo"

";
$ Chi-> showChiSquareStats ();

?>

The script generates the following output:

Table 2. expected frequencies and variance obtained by running the package script
Other Total Keiths Olands Schoner
Observed value: 285 250 215 250
Expected value: 250 250 250 250
Variance 4.90 0.00 4.90 0.00 9.80

Table 3. various X-square distribution statistics obtained by running the package script
DF obtains the probability critical value.
X Square Distribution 3 9.80 0.02 7.81

Table 2 shows the expected frequency and the variance measurement (O-E) of each unit. The sum of the variance is equal to the obtained X-squared distribution (9.80). This value is displayed in the lower right unit of the summary table.

Table 3 reports various X-square distribution statistics. It includes the degrees of freedom used in the analysis, and again reports the obtained X-square distribution value. The obtained X-square distribution value is re-represented as the ending probability value-in this example, it is 0.02. This means that, under a false condition, we can see that the probability of the limit value of X square distribution of 9.80 is 2% (this is a very low probability ).

If you decide Exclude false settings-The results can be obtained based on zero-distribution random sampling variability, so most statisticians will not be controversial. The results of your polls are more likely to reflect the real differences in beer brand preferences of beer consumers in Nova skeyse.

To confirm this conclusion, we can use the obtained X-square distribution value Critical value.

Why is the critical value very important? The critical value is based on an important level (alpha disconnection level) set for the analysis. The alpha disconnection value is set to 0.05 by convention (this value is used in the above analysis ). This setting is used to find the position (or critical value) in the sampling distribution of X square distribution where the ending number area is equal to the alpha disconnection value (0.05 ).

In this article, the obtained X-square distribution value is greater than the critical value. This means that the threshold value for false settings is exceeded. Another assumption is that there is a proportional difference in the object population-which may be more accurate in statistics.

In the automated analysis of data streams, alpha disconnection settings can be used to set output filtering for knowledge-discovery algorithms (for example, Chi Square automation Interaction Detection and CHIAD, such algorithms cannot provide detailed guidance for people in discovering truly useful patterns.


Perform a new public opinion test
Another interesting application of the one-way X-squared distribution test is to repeat a public opinion test to see if people's answers have changed.

Assume that, after a while, you plan to perform another Web polls for beer consumers in the new Skeys region. Once again, you asked about their preferred beer brand and now observed the following results:

Table 4. new beer polls
Keiths Olands Schoner others
385 (27.50%) 350 (25.00%) 315 (22.50%) 350 (25.00%)


The old data is as follows:

Table 1. old beer polls (shown again)
Keiths Olands Schoner others
285 (28.50%) 250 (25.00%) 215 (21.50%) 250 (25.00%)


The obvious difference between the results of the public opinion test is that the first public opinion test has 1,000 respondents, and the second has 1,400 respondents. The main impact of these additional respondents is that the frequency of each answer increases by 100 points.

When you are ready to analyze new polls, you can use the default method-calculate the expected frequency to analyze the data, you can also use the expected probability of each result (based on the ratio observed in the previous polls) to initialize the analysis. In the second case, you load the previously obtained proportions into the expected probability array ($ ExpProb) and use them to calculate the expected frequency values for each answer option.

Listing 6 shows the beer polls analysis code used to detect preference changes:


Listing 6. checking preference changes


// Beer_repoll_analysis.php

Require_once "../init. php ";

Require PHP_MATH. "chi/ChiSquare1D_HTML.php ";

$ Headings = array ("Keiths", "Olands", "Schoner", "Other ");

$ ObsFreq = array (385,350,315,350 );
$ Alpha = 0.05;
$ ExpProb = array (. 285,. 250,. 215,. 250 );

$ Chi = new ChiSquare1D_HTML ($ ObsFreq, $ Alpha, $ ExpProb );

$ Chi-> showTableSummary ($ Headings );
Echo"

";
$ Chi-> showChiSquareStats ();

?>




Tables 5 and 6 show the HTML output generated by the beer_repoll_analysis.php script:

Table 5. expected frequencies and variance obtained by running beer_repoll_analysis.php
Other Total Keiths Olands Schoner
Observed value: 385 350 315 350
Expected value: 399 350 301 350
Variance 0.49 0.00 0.65 0.00 1.14


Table 6. statistics on various X-square distributions obtained by running beer_repoll_analysis.php
DF obtains the probability critical value.
X Square Distribution 3 1.14 0.77 7.81

Table 6 shows that the probability of obtaining an X-squared distribution value of 1.14 is 77% under false conditions. We cannot rule out such a false assumption that, since the last public opinion test, the consumer preferences of beer in the Nova Skeys have changed. Any difference between the observed frequency and the expected frequency can be interpreted as the expected sampling variability of the same beer consumer in Moscow. Considering that the conversion of the initial results of a public opinion test was only completed by adding a constant of 100 to the results of each public opinion test on the forward side, this zero discovery should not be surprising.

However, you can imagine that the results have changed, imagine that these results may imply that beer of another brand is becoming more popular (note the variance size reported at the bottom of each column in Table 5 ). You can further imagine that this discovery has a significant impact on the financial aspect of the brewery discussed, because bar owners tend to buy the best-selling beer in the bar.

These results will be examined in great detail by the brewery owner, who will question the analytical process and the suitability of the experimental method; in particular, they will question the representativeness of the sample. If you plan to conduct a Web experiment, this experiment may have important practical meanings. for the experiment method used to collect data and the analysis technology used to draw inferences from data, you need to pay the same attention.

Therefore, this article not only lays a good foundation for you to enhance your effective understanding of Web data, but also provides some suggestions, these suggestions are about how to protect your choice of statistical tests and make the conclusions obtained from the data more rational.


Applied Knowledge

In this article, you have learned how to apply inferential statistics to commonly used frequency data for summarizing Web data streams, focusing on the analysis of Web opinion polls. However, the simple one-way X-square distribution analysis process discussed can also be effectively applied to other types of data streams (access logs, survey results, customer summary information, and customer orders ), to convert raw data into useful knowledge.

When applying inferential statistics to Web data, I also introduced the possibility of viewing data streams as the results of a Web experiment, so as to increase the possibility of referencing experiment design considerations when making inferences. Generally, you cannot make inferences because you lack sufficient control over the data collection process. However, this can be changed if you are more active when applying the design principles of the experiment to Web data collection processes (for example, randomly selecting a voter during your Web polls.

Finally, I demonstrated how to simulate the sampling distribution of X square distributions with different degrees of freedom, not just its source. In this process, the expected frequency of the measurement category is less than 5 (in other words, a small N experiment) -I also demonstrated a work und (using a small $ NTrials value to simulate the sample distribution of an experiment) to disable the X-squared distribution test. Therefore, I do not just use df in the study process to calculate the probability of the sample result. for a small number of attempts, you may also need to use the $ NTrials value as a parameter to obtain the probability of the observed X-square distribution.

It is worthwhile to consider how you may analyze small N experiments, because you may typically want to analyze your data before data collection is complete-when each observation is expensive, it takes a long time to get an observation, or you are curious. When trying this level of Web data analysis, it is best to remember the following two questions:

* Do you have any reason to make inferences under the N condition?
* Does simulation help you decide what inferences are obtained in these environments?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.