Big data analysis old birds give rookie learn from the experience of the younger brothers

Source: Internet
Author: User
Keywords nbsp variables effects regressions

The author of this article: Wuyuchuan

&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; The following is my experience in the past three years to do all kinds of measurement and statistical analysis of the deepest feelings, or can be helpful to everyone. Of course, it is not ABC's tutorial, nor detailed data analysis method introduction, it is only "summary" and "experience." Because what I have done is very miscellaneous, I am not a statistical, mathematical origin, so this article is not the main line, only fragments, and the content of the text is only a personal point of view, many assertions have no mathematical proof, hope statistics, measurement Daniel Pat.

About Software

For me personally, the data analysis software used includes Excel, SPSS, Stata, EVIEWS. Excel can be used for data cleaning, data structure adjustment, complex new variable calculation (including logic calculation) in the early stage of analysis, and its charting and tabulation function is an irreplaceable tool in the later stage when it presents beautiful graphs. But what needs to be explained is that Excel is just office software, and it The role is mostly confined to the operation of the data itself, rather than the complex statistical and econometric analysis, and, when the sample size reached the "million" level, the speed of Excel can sometimes make people crazy.

SPSS is good at dealing with section data of the Fool Statistics software.

First of all, it is a professional statistical software, "million" or even "100,000" sample volume level of data sets can cope;

Second, it is a statistical software rather than a professional measurement software, therefore, its strength lies in data cleaning, descriptive statistics, hypothesis testing (T, F, chi-square, Fanchazzi, normality, reliability, etc.), multivariate statistical analysis (factors, clustering, discriminant, partial correlation, etc.) and some commonly used econometric analysis (preliminary, The measurement analysis mentioned in the intermediate measurement textbooks can be realized basically, which is powerless for the complicated and cutting-edge measurement analysis.

Third, SPSS is mainly used to analyze sectional data, and function in timing and panel data processing.

Finally, the SPSS compatible menu and programming operation, is a veritable fool software.

Stata and EViews are my preferred metering software. The former is fully programmed, the latter is compatible with menu and programming operations; Although two software can do simple descriptive statistics, but compared to SPSS many; Stata and eviews are all metering software, advanced econometric analysis can be achieved in these two software; Stata extensibility is good, We can use the Internet to find the command file (. ado file) We need, and constantly expand its application, but EViews can only wait for software upgrades, in addition, the processing of time series data, EViews strong.

In summary, each section of software has its own strengths and weaknesses, what software depends on the data itself attributes and analysis methods. Excel is suitable for processing small sample data, SPSS, Stata, EViews can handle larger samples, Excel, SPSS suitable for data cleaning, new variable calculation, and so on before preparation work, and Stata, eviews in this respect poor; charting tables with EXCEL Statistical analysis of sectional data using SPSS, simple econometric analysis of SPSS, Stata, EVIEWS can be achieved, advanced econometric analysis with Stata, EVIEWS, time series analysis with EVIEWS.

About causality

To do statistics or measurement, I think the hardest and most headache is to make causal judgments. If you have data for a, b two variables, how do you know which variable is due (the argument) and which variable is the fruit (dependent variable)?

In the early days, causal inferences were made by observing the surface links between causes and outcomes, such as constant rendezvous and chronological order. However, it is gradually recognized that many common occurrences and common deletions may be causal, or may be caused by common causes or other factors. From the perspective of induction, if B appears in the case of a, there is no B in the case of a, and a is probably the cause of B, but it may be that other unforeseen factors are at work, so a lot of cases should be compared in order to improve the reliability of judgment.

There are two solutions to the causal problem: statistical solutions and Scientific solutions. The statistical solution mainly refers to the use of statistical and econometric regression methods to analyze the microscopic data, comparing the difference between the intervention sample and the sample of the unacceptable intervention in the effect index (dependent variable). It needs to be emphasized that the results of statistical analysis of cross-sectional data, whether by means of mean comparison, frequency analysis, variance analysis and correlation analysis, are only necessary and not sufficient conditions for the establishment of causal relationship between intervention and effect. Similarly, the use of cross-sectional data to measure regression, the most can be obtained is only the number of variables between the relationship; the measurement model of which variable is due to variable which is the independent variable, completely out of the analyst based on other considerations, and the results of the measurement analysis is not related. In short, regression does not imply the establishment of causal relationship, the judgment or inference of causality must be based on the relevant theory of Practice test. Although the use of cross-sectional data to make causal judgments is reluctant, but if the researcher mastered the time series data, causal judgment is still available, the most classical method is to carry out "Granger causality test." However, the conclusion of Granger causality test is only a statistical sense of causality, not necessarily a real causal relationship, moreover, the Granger causality test has higher requirements for data (multi-period time series data), so this method can not be used in the section data. To sum up, statistics, econometric analysis of the results can be a real causal relationship of support, but not as a positive or negative causal relationship between the final basis.

The scientific solution mainly refers to the experimental method, including random grouping experiment and quasi experiment. An experimental approach to the evaluation of the effects of intervention, in addition to the intervention of other factors to control, so that the effect of intervention to the intervention itself, which solves the problem of causal identification.

About the experiment

In the randomized experiment, the samples were randomly divided into two groups, one group underwent treatment conditions (into the intervention group), the other group received control conditions (entered the control group), and then compared the average results of the two groups of samples. Random grouping makes the two groups of samples "homogeneous", that is, "grouping", "intervention" and all their own attributes of the sample independent of each other, so that through the intervention at the end of the two groups in the effect of the difference between the results of the study to investigate the net effect of treatment. The randomized experimental design method can guarantee the similarity between the intervention group and the control group to the maximum extent, and the conclusions obtained are more reliable and persuasive. But this approach is also controversial

One is because it is more difficult to implement and higher cost;

Second, because in the impact assessment of intervention, the acceptance of intervention is not usually randomly occurring;

Third, in the field of social science research, the practice of completely randomly allocating subjects involves the study of ethical and ethical issues.

In view of the above reasons, quasi experimental design using non random data is an alternative method. The standard for distinguishing between quasi-experiment and random experiment is that the former has no randomly assigned samples.

The effect of the intervention is evaluated by the quasi experiment, because the sample acceptance intervention is not random, but artificial choice, so for the non random data, it is not easy to think that the difference of the effect index comes from the intervention. After eliminating the intervention factors, the intervention group and the control group may have some factors influencing the effect index, which may be confused with the effect of the intervention on the effectiveness index. To solve this problem, statistical or econometric methods can be used to control other possible factors other than intervention factors, or using matching methods to adjust the imbalance of sample properties--in the control group to find a pair of other factors that are the same as those of the intervention group, except for the different intervention factors- This ensures that these impact factors and grouping arrangements are independent.

Randomised trials require at least two period of panel data and require samples to be randomly distributed in both intervention and control groups, analysis method is did (Times difference method, or double difference method), quasi-experimental analysis can be done with cross-sectional data, do not require samples in the intervention group and control group of random distribution, analysis methods include did (need two panel data), PSM (Tendentious score matching method, need one section data) and Psm-did (need two time panel data). From the angle of accuracy, the accuracy of random experiments is higher than that of quasi and non experimental analysis.

About the choice of analysis tools

If the causal relationship between variables has been preset according to theory or logic, then no experimental method is used. My selection principles for non-experimental data analysis tools are as follows.

The ① variable is a continuous variable, and the independent variable has at least one continuous variable for multivariate linear regression;

② variable is a continuous variable, the independent variable is all classified variables, analysis of variance;

The ③ variable is classified variable, the independent variable has at least one continuous variable, and the Logit model or probit model is used.

The ④ variables are classified variables, and the independent variables are all classified variables, and the cross Table analysis and the card square test are carried out.

The ⑤ variable is distributed within a closed interval, and more samples fall on the boundary of the closed interval, using the Tobit model

⑥ variables are not unique, such as multiple output problems, data Envelopment analysis (DEA);

The ⑦ variable is an integer, the numerical value is small, the number of 0 numbers is more, and the count model is used.

The ⑧ data has a hierarchical structure (nested structure), using a multilayered linear model (HLM).

With the development of statistics and econometrics, various frontier analysis tools emerge, but I think the most reliable analysis tools are the following four kinds: DID (for random experiments), multivariate linear regression, fixed effect variable intercept model (FE, for panel data), Logit model or Probit model (for classification dependent variable data).

Other methods or application conditions harsh, or the analysis process toss, or the method itself unreliable (especially clustering analysis, discriminant analysis, super unreliable), so can use the above four methods to analyze the problem, do not have to "dazzle method" and blindly toss.

The significance of the optimal degree of fit, the principle of variable selection and the absolute size of the estimated value

In everyone's "data analysis" station, a classmate put forward such a question: "Multivariate regression analysis, how to select the independent variable and the dependent variable, you can make R side to 80% or more?" ”

Obviously, the students who asked the question either did not learn the measurement well, or made the utilitarian mistake, or both. The size of the fit degree depends largely on the nature of the data itself. If the data is time series data, as long as the regression of the variables with a bit of correlation can make the fitting excellent degree more than 80%, but such a high R side does not say anything, it is likely to cause the analyst to fall into the trap of false return, rigorous practice of course is to do the smoothness test and cointegration test; If it's a section data, There is no need to pursue R side to 80% degree, in general, there is a 20%, 30% is very big.

If you must increase the R side, the most important thing to do is to select the variables that are included in the model. I think there are three rules for choosing the inclusion model.

First, from the theory and logic, the variable can be affected as a variable into the model, that is, theoretically or logically can affect the dependent variable must be included in the model, even if the regression coefficient of the independent variable is not significant.

Second, the principle of the AUM--if not necessary, do not increase the entity, that is, theoretically or logically can not affect the dependent variables can not be included in the model, even if the autoregressive coefficient of the argument is significant.

Third, prevent the inclusion of independent variables with multiple collinearity.

As I said before, it's a great thing for the section data to be measured and analyzed, and R can reach 20%, 30%. However, if the fitting goodness (or an index similar to the degree of fit) is at 20%, 30%, or lower, the regression coefficient only has a qualitative or sequential meaning, emphasizing that the size of its absolute value is of little significance. For example, the R square of the Lny=alna+blnb+...+zlnz+c regression is 20%,a to be 0.375,b 0.224, and the T-test of the two is significant, then we can say that A and B have an effect on Y, or that a change of a percentage of a is more than 100 points of B change to Y. Effect (control of other factors), but it is meaningless to say that a change in a percentage of the influence of Y is more than a percentage of the effect of B on Y by 0.151%.

Some other advice or advice

Think carefully about the causal relationship between variables: is a influence B or b affect a? Is there really a causal relationship between A and B? Is there a C so that C affects both A and B, and A, b itself is not directly related?

Carefully select the independent variable, do not omit important variables, otherwise it will cause endogenous problems. If you encounter an endogenous problem, do not be busy looking for tool variables or using 2SLS, looking for missing variables is the most important thing. If the missing variable even if found in all kinds of difficulties can not be included in the analysis, and you suddenly think of an excellent tool variables, so congratulations, you can send articles in the core periodicals!

It is important to control other factors that may affect the dependent variables and to recognize that the interpretation of the regression coefficient and the partial correlation analysis is based on "other conditions unchanged".

Don't be too busy to be happy when you see a large R side, if the F test is significant and the T test is not significant, it is possible to have multiple collinearity. When you see a large t value, don't be too busy to be happy, because it's probably the product of a pseudo regression, and if the DW value is small (less than 0.5), then the likelihood of a pseudo regression becomes larger.

The average comparison, though simple, tests the analyst's rigor. Does the two seemingly different averages, median or ratio mean a difference? Is the sample taken from an independent population or a related population? Variance is "qi" or "not aligned"? Are averages, median, or ratio differences compared?

The sample size limits the analysis that can be done, please cherish the freedom when small sample; Do not use data less than 30 samples for econometric analysis (especially time series analysis) and complex statistical analysis; Do not assume that you can see "development trends" from data that is less than or equal to 5; Use complex models and analytical methods without reference. Do not deliberately complicate simple questions at a glance.

Most importantly, don't fake it! Do not fake the data itself, nor the analysis of the results of fraud! Data analysis can be a certain cleaning before the singular value removed, you can also try to the unexpected analysis of the results of the discussion and interpretation, but if the data to change the results of analysis, then what is the need for data analysis? Just edit the article and report it. Some of the "weird" and perverse data analysis results are likely to be the most important findings.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.