Large data in the eyes of a liberal arts professor: Multi, fast, rough, consumption

Source: Internet
Author: User
Keywords Big data we big data now
Tags .mall acceptance analysis application automatic basic big data business

With the current network language, I am a liberal arts man. Mo Yan recently said in his acceptance of the Nobel Prize that literature is not science and literature is useless. I would like to explain that literature is not equal to Uvenko, liberal arts are more extensive, can be further divided into the humanities and social sciences. Social science research has always been dealing with data, of course, used to be small data, the number of low, slow, time-consuming, but good quality, but also provincial resources, in line with the current green concept. On the basis of my experience in studying small data for many years, I am talking about some of the views on large data, which is also a consensus of the social sciences. The reader may agree that the liberal arts (or social sciences), like science, can do something useful.

The big data right now is the big heat. I made a small statistic, and the Sci/ssci Journal has published 270 research papers on large data, mostly in the last year or two. The largest number of them came from computer science and engineering technology, which accounted for 27%, followed by medical Biochemistry (20%) and the mathematics and Physics of basic research (11%), and at least business Administration (8%) and Social Sciences (7%). I'm dealing with the last 15% of these studies.

I was fortunate enough to join the recently established large data experts committee of the China Computer Society, and also participated in the Committee's selection of some of the hot issues and development trends of large data research. As I understand it, the recent Committee's eight hot issues and ten trends in large data research should be some of the most systematic views and expressions in the world so far. Of course, the United States and European institutions, businesses, academic groups have some very insightful, wonderful views, but from a comprehensive point this may be the first document.

The concept of large data has received a lot of criticism as well as attention. Perhaps everyone has seen this story, and Sybase's technology chief, Irfan Khan, thinks "Big data is a big lie". Sybase is a database company, a long-term business for the BI application tools. They have been doing a lot of work on the data, so they feel that all the things in the big data are already there, not something new. He's a bit of a bubble in this perspective, a bit exaggerated. Of course, people who do empirical research do not fully agree with him that there are some areas of big data hype that are somewhat exaggerated but not exaggerated by false lies.

What is big Data? The most popular definition is 4 V:volume, Velocity, produced, Value. I think these 4 v are roughly related to the words "many, fast, good and province" in Chinese. Some of these four words are theoretically achievable, some actually show up, but some are still gaps. Is the big data really much, fast, good, and provincial? I'll talk about it a little below.

"Many" of large data

The big data first is the number of data. This is a lot to say and there seems to be no disagreement. Not really. The key is whether we use aggregate data, sample data, or local data. What is the overall data? The most intuitive example is that every 10 China and many other countries have a census of all the population, and the result is overall data. China's last census in 2010 found 1.38 billion people. Sample data is also easy to understand and obtained by sampling. For example, population statistics, in addition to every 10-year census, the National Population Centre does 2 per thousand sample surveys each year, using the sample data to estimate the change in Chinese population growth between two censuses every 10 years. Local data, however, is a subset of the overall data, but is not extracted from the population by random methods, but is obtained by means of various conveniences or existing methods. Local data is often much larger than sample data, but there are strict differences between the two.

These are common sense. Let's take a look at the big data. In theory, large data should mean the overall data. But in fact, for technical, commercial, confidential and other reasons, except for a handful of large data owners (such as Taobao, Sina Weibo, National Grid, education network, etc., they may actually master the overall data), for the vast majority of third parties, now the big data, Basically not the overall data but the local data. Note that this local data, even if it accounts for a large percentage of the population (70%, 80%), is neither the overall data nor the sampled data. Because even if 10% or 20% of the cases are missing, local data may be very different from the general population.

Three kinds of data, if only consider the quality, regardless of cost, efficiency and other factors, the overall data is the most reliable, followed by the sample data, the most unreliable is the local data. I believe many engineering men do not agree with this last remark. In our view, the sample data, although the scale is much smaller, but in fact in many cases than the local data to be more valuable and more reliable. Figure I did a simulation, randomly generated 10,000 values (blue), let's consider it as a whole. I randomly smoked 500 values (red), which looked sparse (that is, the error is large and imprecise), but it is good for the overall representation that the averages on the X and Y axes are the same as the general, at the origin. I smoke 8,000 more values (that is, total 80%) of local data (green), artificially set some restrictions so that the positive values more easily, the result is much more tightly (that is, the error is small), but the average to the upper right side, that is, accurate but inaccurate. If you rely on local data, the mass size of the local data is confused, it is actually killing people.

There have been many cases in history, showing local data useless. The method of social science research is generally referred to the 1936-year presidential election in the United States. At that time, two companies were making election forecasts. One is "Literature Digest" magazine, with the magazine to send readers questionnaires, recycled 2.5 million questionnaires. At that time the American electorate was about 100 million, and 2.5 million was already a fairly large part of the data. They were analyzed to predict that the Republican Party was 14% ahead of the Democratic Roosevelt and was elected with absolute advantage. Another was the Gallup polling center, a start-up small company that surveyed 50,000 people with a random sampling, predicting that Roosevelt would be elected at 56% of the vote. The final election results Roosevelt defeated Brandon, or Gallup's small sample defeated the literature digest of the overall data. Because the people who ordered the magazine were more rich, they were more supportive of the Republican Party. The data is large and unrepresentative, and the consequences are worse.

The amount of information, on the one hand, is determined by the number of cases, on the other hand by the variables (that is, the characteristics of the case) how much feel. Social scientists use fewer cases and more variables. The ideal of large data, not only to case more, but also to many variables. I know that the big data in real life are just a few cases, not many variables, just the opposite of the small numbers of our social scientists. The data structure with multiple cases and fewer variables is the basic reality of the large numbers we face. One of the reasons is that each person has only a small number of variables, known as the islands of data. Only through sharing, integration can produce multiple cases, multivariate real large data.

Big Data "Fast"

Now the processing method of large data, in unit speedometer, is certainly fast. However, it is not interesting to say that efficiency is ineffective. I still use social science as a slow example and some of the basic methods of today's big data to compare. All we do is hand-annotated, large data is mainly automatic classification. On the scale is no way to compare, we generally a sample of only thousands of, now millions of is small data, hundreds of billions is the normal. In terms of accuracy, the human will never exceed the machine. Some people have counted, I observed, machine learning accuracy average about 80%, of course, some do natural language processing, artificial intelligence will say a certain project can do 90%. But if all the research is averaged, 80% may be a more upbeat record. Under the artificial situation basically can do 90%, 95%, the general social Science Academic periodical does not accept the accuracy rate below 95% thesis.

The other question is, how do you know the accuracy rate? Our general method is to estimate the accuracy by two or more people separately (that is, back-to-back, not aware) of the same content. and the processing of large data, if it is automatic unsupervised learning, the accuracy of the results is not actually know. Now everybody is grasping the on-line content to make the forecast, in the end the forecast is inaccurate, perhaps forever is an unknown. From the error point of view, artificial judgments have errors, but these errors are personal errors, if several people do at the same time, the error can cancel each other. Machine learning error is systematic, if you know where the bias, you can easily change it, but the error bias in general is not known. This is what I just said, the local data problem in the system error, in the end is to the left, or to the right side, high or low is not known. So, according to our opinion, the results of the study are accurate, but not accurate enough to be stable. Machine learning method is the reverse, because you have a lot of data, very accurate. In fact, the word precision is precision from English, only precise meaning without accurate (correct or accurate) meaning. Precision is not a problem with large data at present. It is natural to think that we need to combine manual tagging and automatic categorization to do supervised machine learning. The quality of machine learning is determined by the quality of the training set, the scale of the training set and the algorithm of learning, and the importance of these three is ranked according to this order, the most important is the quality of the training set, that is, the quality of the manual annotation.

"Province" of large data

The question here is whether Labor is saved or energy is saved. Big data certainly saves Labor, but at the same time consumes energy. This is a big environmental problem and I don't start talking about the fact that big data is incredibly power-hungry. If you start planning now and don't pay attention, maybe a few years later the big data will become a new polluting heavy industry. I've heard that there are millions of of servers built in large data centers in some places. We can imagine that the energy consumed and the radiation generated by it are very terrible. In fact, the data is now increasing at an annual rate much faster than our current storage capabilities. In this case, unless we have a breakthrough in our storage material, we have to think of a problem, can we really save the overall data? China Unicom's data can only be saved for 4 months and must then be deleted to save the new data. I think the way out is still sampling, to make the big data smaller.

Big Data "good"

Will big data be better than small data? The question is at the heart of all the problems and there is no answer at the moment. I think the following questions are worth considering. First, the big data is good, but where is the big data? If we don't get the big data, it's a cake in the window and it can only be seen outside. We can divide large data into several small, medium-scale, mega-scale. Small-scale data is very large and can be obtained free of charge. Medium-scale data is also free or low-cost in most cases. The big data on the real mega-scale are not. To do the application or to do the service of the tool, all must face this reality.

Second, do we really have the ability to process and analyze large data? I think the big data analysis tools are not developed now, most of the tools used are used to solve small data problems, to solve the normal data. Statistical tools for heterogeneous data are now largely not. A recent article published in the journal Science reported a method for two-meta correlation analysis of large data. As you can see from the statistics, the two-element correlation analysis of small data is more than 100 years ago. In other words, our ability to process large data is still in its infancy, equivalent to the small data level of the the 1880s. Of course, we certainly do not need to spend another 120 years to enable large data analysis capabilities to reach the level of today's small data. However, we must have an objective and sufficient understanding of the current situation of large data analysis capabilities.

In short, my view of large data is not entirely optimistic, nor completely pessimistic. The big data certainly represents a new century, the arrival of the epoch. The potential value of large data is also objective. But the application of data, the sharing of data, in fact, there are many problems. Data storage and analysis, in fact, is just the beginning. Business and social applications are now far ahead of scientific research. Scientists and social scientists who are interested in large data research should try to catch up.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.