The fate and impact of large data characteristics on statistical applications

Last Update:2015-04-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Abstract: This article discusses the medical large data (above) and discusses the medical large data (middle). The uniqueness of large data characteristics to the fate and impact of data in statistical applications challenges traditional statistical methods and stimulates new statistics for large data analysis

This article discusses the medical large data (on) and the Medical large data (middle).

The fate and impact of large data characteristics on statistical applications

The uniqueness of large data challenges traditional statistical methods and stimulates the development of new statistical methods that are suitable for large data analysis. Some of the encounters and problems mentioned in this article are part of the author's own opinion, part of which is excerpted from other articles (Fan, Han, & Liu, 2014) (Wang & Wang, 2014).

Compared with the professional writing methods of statistical articles, the author tries to introduce these problems in a more understandable way, so that the general readers can have a certain understanding and interest in this. Traditional data is generally larger than the number of factors of interest, such as a data has 200 records on whether individuals have cardiovascular disease, may be related to factors such as sex, age, blood pressure. There are only 4 factors, but the sample size is 200>>4.

Large numbers, however, have huge amounts of samples and quite a few factors. or a cardiovascular example, now, for example, we have a sample size of tens of thousands of records, but at the same time also has hundreds of factors, a variety of previously unable to collect factors are collected, such as exercise or not, how much exercise, exercise type, diet, diet content, drinking or not, what wine, drinking habits and so on. This makes the research and application of data in statistics get a new chance and also face new challenges.

Data heterogeneity (heterogeneity)

Data heterogeneity, can be simply understood as a large sample of data have a lot of small samples, each small sample has different data characteristics, such as the average of small samples have a high and low, discrete degree of dense sparse, as if the ocean has different temperatures, different densities of various currents. We cannot simply do statistical analysis at the level of a large sample, which results in deviations when used to estimate or predict the individual in a small sample or sample, because each small sample may have some of its own unique characteristics.

When the data sample is small, the smaller sample is smaller. In this case, there may be only one or two data records in a small sample, which can only be treated as outliers and cannot be parsed. In large data, this unique feature of the collection of data records, there is a statistical analysis of the conditions, so that we better explore the relevance of specific factors, understanding the heterogeneity of these data. For example, there are extremely rare diseases that occur only in certain groups of people, and large numbers allow us to study the causes, risk factors, and understand why some therapies are good for some people, while the same approach is harmful to the other people, and so on.

Similarly, because of the large number of samples and huge numbers of factors exist in the big data, the complexity of information will increase a lot, by the impact of complexity, may lead to statistical overfitting. Over fitting means that we have created a complex statistical model that can well describe the situation of existing data, but when we want to apply this model to the prediction of new data, it behaves rather poorly. As shown in Figure nine:

Figure Nine

The curve on the left of Fig. Nine is our model for the blue dots (as the existing data), which can basically describe the distribution of the blue dots and the high degree of anastomosis of the curves and blue dots. Using this curve to describe Huang (as new data), the degree of anastomosis is also good. The curve on the right of Fig. Nine completely passes through each blue point, with a very high degree of anastomosis, which fully describes the complex characteristics of the blue dots. However, it describes the yellow point, the degree of anastomosis is much worse, the deviation is much larger than the curve on the left. In simple terms, the more complex the data, the more factors need to be considered, the more difficult it is to establish a universal and effective statistical model.

Deviation recognition (Bias accumulation)

When analyzing data, we need to estimate or test many parameters to establish a reliable statistical data model. It is inevitable that there will be deviations in these estimates, which are largely influenced by the size of the data and the number of parameters. In general small data, this problem can not be significant. But in the case of large data, the problem becomes quite noteworthy. We use a simplified example to illustrate this problem.

Let's say we have two sets of data A and B, the group A data collection is not biased, all the sample values are 1000. B Group of data, the actual number of all samples is also 1000, but there are deviations, and deviation with the collection of sample size increased exponentially growth (in order to illustrate the situation, deviation exponential growth is a very extreme example). For each additional record, the deviation growth formula is:

So the first record in Group B contains the deviation of 1.001=1.0011. The first value of Group B is 1000x1.001=1001. The second record in Group B contains the deviation of 1.002001=1.0012. The first value of Group B is 1000x1.002001=1002.001. The tenth value of Group B is 1000x1.01004512=1010.045. So if it is a small data n=10, a group of data is actually compared with Group B data is not big. The deviation of each digit in Group B is not enough to cause attention, if the deviation within 2% is acceptable.

However, when we collected 10,000 data records, the situation changed a lot. Let's see that the last 10 data are quite significant.

A group of data and Group B data, with a large number of data samples, the difference is already 108,000. Figure 10 shows the variance with the increase in sample size. In the sample number is about 4236, the increase of deviation is not obvious. Over 4236, the deviation has increased dramatically.

Figure 10

So judging from this, we can say that in the data sample volume of about 4000, a and group B, the difference may be small. But when the data sample size is greater than 4000, A and group B comparisons may be quite different. This example fully demonstrates that large data is easier to identify with data deviations than smaller data, thus discovering problems in the data collection process and improving them.

False correlation (spurious correlation)

False correlation, we use an example to explain. Here's an example of the cardiovascular data mentioned above. Only 200 records are now collected, but each record has 100 elements of information on each side. In this way, we want to see if these 100 factors are related to "whether there is cardiovascular disease". So, we conducted 22 test tests: whether there are cardiovascular disease and factors one test, whether there are cardiovascular disease and Factor II to test ... Whether the cardiovascular disease and factor 100 were tested.

There are only two scenarios for each test result: statistically meaningful and statistically insignificant.

Statistically significant, simply speaking, cardiovascular disease is associated with that factor. Statistically insignificant is the belief that cardiovascular disease is unrelated to this factor. In the process, you may find that there are about 5 times that factors that are statistically considered to be associated with cardiovascular disease are actually judged by common sense and reality without any connection, meaning that it is statistically incorrect. This is false correlation.

In order for us to know what it is and why, here's how to define "statistically meaningful". In general, when testing, we will define a value, called the first class statistical error rate. This error rate is usually set to 5%, that is to say, every 100 tests, we allow 5 statistically insignificant errors to be judged statistically meaningful (if the existence of statistical error rates is not allowed, that is 100% of the correct rate, that is, there is no uncertainty.) If you have such data, you do not need to do any statistical hypothesis test.

In other words, if there is no correlation, we allow 5 errors in the 100-time hypothesis test. This is the reason for the false correlation in the above example. In the face of huge amount of data and super multi-dimensional factors, when a lot of testing of a data test, it is inevitable that there will be false correlation. How to deal with this problem, statistics is still doing a further study.

Insignificant significance (meaningless significance)

There is also a situation we call insignificant significance (Lin, Lucas, & Shmueli, 2013). When we do the analysis and comparison of two sets of data, if Group A and group B have only 1000 data records, we test whether the average of the two sets of data is the same, and the results tell us statistically meaningless. In other words, there is no statistically significant difference in the average of the two sets of data. But when the numbers are tens of millions, the results of the tests tell us statistically meaningful.

What's going on here? We go back to the source to see why do we have to do the statistical analysis of two sets of data comparison? Is it not possible to have the average of two groups equal to each other? Of course not, because we really want to analyze the results of the comparison to reflect the objective phenomenon of 100% overall quantitative data. A simple and isolated comparison of the average of the 1000 records in the two groups is an objective phenomenon that is larger than the size, compared with the conclusion that can not be extended to 100% overall quantitative data.

But are these two sets of data equivalent to 100% total quantity data? Of course not, even a huge amount of data is not 100% equal to the total number of data. In this way, some statistical indices of the two sets of data that we have analyzed will be biased against the statistical indices of the overall quantitative data. This deviation generally has a lower and upper limit, which we call the confidence interval. The statistic index of the true total quantity data falls in a certain range (confidence interval) to the left or right of the statistic index of the sample data.

Well, what we actually want to see is the total quantity data of group A and the total number of Group B is the same as the average, in other words, the total quantity data of group A is minus the average of the total quantity data of Group B is equal to zero:. Now we only have a group A sample quantity data average and B group sample number data average, the expression symbol is and. To see is a group of samples, the average number of data minus B group sample number of the difference is equal to 0: But we already know that because of the existence of sample equalization difference, the difference between the sample mean subtraction is not necessarily zero, and the difference has a certain confidence interval.

So we're actually more accurate to see whether 0 falls within the confidence interval of the sample difference (the upper and lower bounds of the confidence interval are consistent with the first statistical error rate mentioned above, and there is a 5% concept in it, which is not described in detail here.) Whether it falls in confidence intervals or not can be used to determine statistically meaningful or meaningless. We're going to say it's statistically meaningless. The average of two sets of data is the same. Don't fall in there we say it is statistically significant that the average of the two sets of data is different. As shown in Figure 11:

Figure 11

So why would the results be different if the sample data were 1000 and tens of millions. What we're going to talk about here is the relationship between the number of samples and the confidence interval. As the number of samples increases, the sample difference will be close to the real total data difference (not necessarily 0), and the uncertainty will be reduced, confidence interval will be shortened, in fact, the estimated difference is more and more accurate. In this case, even if the sample difference is very close to 0 of a number (that is, we all think the average of the two sets of data is the same), but because the confidence interval of the reduction, 0 will still fall in the confidence interval outside (Figure 11, the lower part 2).

As a result, the results will be statistically significant: the average of the two sets of data is different. The use of existing statistical methods in large data leads to such an error message. This is because the existing traditional statistical methods for small data, at the time of being presented, have not faced or thought that the volume of data can be so large. How to solve the problems caused by such data characteristics, we are still on the road.

Herding effect (herding multiplying)

In the big Data age, our society has increasingly digitized and aggregated individual views, and relies on making decisions (such as recommending products or services based on the scores collected). This phenomenon is becoming more and more popular in the medical field. Many assistive medical applications have user ratings on mobile platforms, and people choose whether to use them based on ratings. The services offered by some medical network platforms, such as network interrogation, can also be rated by the users who provide the service, and thus influence the decision whether or not to select the medical officer for consultation.

One of the key requirements for using this "wisdom of the People" is the independence of individual opinions. In the real world, however, the collective opinions collected are rarely made up of disparate independent individual opinions. Recent experimental studies have shown that the previously existing collection of opinions distorts subsequent individual decisions and perceptions of quality and value. Highlights a fundamental difference: the difference between the value we perceive in the collective opinion and the intrinsic value of the product itself.

The reason for this discrepancy is the "herding effect". The simple description of herd effect is the mentality and behavior of the individual's conformity. The flock is a very scattered organization, usually together also blindly left-right bump, but once a sheep move up, the other sheep will be herd without hesitation, completely regardless of the front may have a wolf or not far away there are better grass. Therefore, the "herding effect" is likened to a herd effect, it is easy to lead to blindly follow, and blind obedience tends to fall into cognitive biases, decision-making deviations.

The IBM Watson Research Center (Wang & Wang, 2014) uses a large vertical customer scoring dataset (Amazon) and establishes statistical models to demonstrate that ratings and opinions are generated not by independent, homogeneous processes, but by creating an environment that affects subsequent ratings or opinions. The "herd effect" embodied in this socialized customer scoring system is characterized in that high scores tend to produce new high scores while inhibiting the production of low scores.

The next question is: What is the real quality rating of the product if we can get rid of the herd effect? The statistical model established using the IBM Watson Research Center can partly answer this question. They scored an intrinsic score on Amazon's four product data (books, electronics, film and television, and music) (out of the "herd effect") and the external (no "herd effects") tests. Of all four categories, more than 50% of the product scores differed more than 0.5. This difference shows that there is a significant difference between the perception we get from the collective score and the real value of the product.

One more step, given the current rating of the product, if we exert some artificial manipulation, how will the "herd effect" affect future ratings? Such predictive analysis is valuable in a number of areas, including market earnings estimates, budget advertising and fraud control testing. For example, market analysts may want to estimate the long-term impact of a short-term high score on a product before deciding whether to promote a product.

The research center passes on two types of products (film and television, and music) by inserting a 50-person 5-star rating, it was predicted that, while the two products experienced a similar short-term high score in popularity, the promotion would have a more lasting impact on film and television products in the long run (higher scoring cuts were slower). This provides valuable information for market analysis decisions.

The "herd effect" in such large data can be eliminated by appropriate statistical methods and used to produce more valuable information for decision analysis.

The above examples fully illustrate that in the large data age, the participation of statistical professionals is essential, although database operations require the contribution of professional computer professionals. Data management analysis is not only extraction, retrieval, simple summary, summary. The complexity of the data itself makes the process of analysis full of traps and misunderstandings. There is no statistical theoretical knowledge structure, there will be analysis of deviations, or inefficient data utilization. On the basis of computer algorithm to learn to understand the nature of data statistics, the combination of algorithm and statistical analysis of the future large data analysis is a major direction.

Conclusions and Prospects

This article is a cursory description of what is large data, selectively describes some of the characteristics of large data, medical data and its status in the North American medical system, revealing that large data analysis will have a great impact on the health care sector and impact. Large data gain unprecedented insights and make smarter decisions about data management and analysis of clinical and other data repositories.

In the near future, the application of large data analysis will be rapidly and widely emerging throughout the health care and health care sectors. The data management framework described in this paper, the statistical analysis of data reveals that the effective application of large data is a systematic project, requiring a series of professional skills to ensure the success of large data analysis, including: processing, integration, analysis of complex data and to help customers fully understand the results of data analysis. These require a wide range of professional skills and qualities, including:

Computer Science/Data development expertise: A solid computer science Foundation and the ability to use the infrastructure of large data. Analysis and modeling capabilities: quickly analyze and establish effective statistical models based on understanding the data. This requires not only solid statistics, but also sharp thinking and insight. Curiosity and creative thinking ability: This requires a passion for data, good at comprehensive and sensitive thinking and mining problems. Some organizations look for talent to see who can brainstorm the data. Outstanding communication skills: Integrate data and Results analysis report, can clearly and explicitly in the unprofessional language to help customers or the public understand the results of data analysis and make decisions.

Of course, it's hard to find a talent with all of the above skills, but working together to build efficient large data groups is the right direction. Thus, in this big data analysis become more mainstream era, seize the opportunity, stand out or make a further step.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The fate and impact of large data characteristics on statistical applications

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support