Cloud Large Data Forum: Exploring the social ecology of large data

Source: Internet
Author: User
Keywords us large data very now if

Moderator: I am honored to introduce the first speaker, Professor Zhujianhua from the City University of Hong Kong. I wish teachers in the United States and Hong Kong have a long working experience, for the dissemination of sociology has a deep attainments. The subject of his speech today is a large figure in the eyes of a liberal arts professor. Please。

Zhujianhua: Thank you for the introduction of teacher Zhou, with the current network language, I am a liberal arts man, or an old man of liberal Arts. Most recently, the most impressive thing is Mo Yan's acceptance of the Nobel award, saying that literature is not science and literature is useless. I would like to explain that literature is not equal to Uvenko, liberal arts are more extensive, in foreign countries into the humanities and social sciences. We are actually in the social sciences, of course, most of us here should be science men or engineering men. I am very grateful to the Conference for putting my speech in the first place. I think I'm a little sorry. Organizers, some liberal arts men or social sciences researchers on the big data, some views may not be so optimistic, some reservations, these questions are for everyone's reference.

A few speakers this morning have talked about the big data now is hot, I made a small statistic, (as pictured) to the left in Google search terms for large data search a lot. On the right is a research paper on big data in the journal Sci/ssci, which, by contrast, is more enthusiastic about big data than the society as a whole. The paper on big data has increased a lot in recent months, but the big data in the general user's concern is increasing smoothly.

I took a closer look at what subjects were among the scholars who studied large data. First, it should be information science, computer science, accounting for 27%. The equivalent of engineering technology, of course, is that in many schools engineering technology is intertwined with computer information science and technology. These two parts account for more than half. Then there are two larger areas of application, medical biochemistry is a class, the following is basic research, mathematics, physics. Down two smaller teams, one is the business school, the other is the social sciences, plus about 15%. The research we're doing is almost 15%.

Although is a liberal arts man, but has been doing data, before doing is small data, I was fortunate enough to be invited to participate in the Chinese Computer Society recently set up a large data expert committee, but also participated in large data committee on large data research on some of the hot issues and development trends of the selection work. As I understand it, the recently released eight hot issues and ten trends in large data research have been the most systematic views and expressions on the global scale. Of course, the United States and European institutions, various business academic groups have some very insightful, very exciting views, but from a comprehensive point this may be the first document.

The concept of big data has been a subject of concern recently, and there has been a lot of criticism. Perhaps everyone has seen this story, he is Sybase's technology president, he said big data is a big lie. Sybase is probably very familiar with this company, is to do the database, long-term for the enterprise to do BI application tools. They have been doing a lot of work on the data, so they feel that all the things that are being said are already there, not something new. So, from this point of view, he felt a bit of a bubble, a bit exaggerated. People who do political research do not fully agree that some places are somewhat exaggerated, but not exaggerated to be false bubbles.

Back to 4 V, what is the big data. From 4 V can be derived into the Chinese economical of the four words. Which is theoretically should be achieved, which is actually already in the show, which now has gaps, is not economical. Many, we talk about a lot of cases, do data analysis has a pair of basic concepts, that is, we are not only concerned about the case, at the same time concerned about the variable, variable is the characteristic value. If it is to do a network traffic data, we are concerned about is the netizen a variety of characteristic values, make products, the product also has more eigenvalues. The intersection of the two, in a two-dimensional table, the case is each row, the variable is each column. In such a range of discussions, according to my thinking we should be concerned about the overall and the choice between the sample, in the end we should be a sample or the overall selection.

I would like to introduce another concept, we have a general data, we are very clear, it is most easy to explain that every ten years China and many other countries to the national population census, the last census in 2010, found 1.38 billion. Another concept we often use samples, random checks, each year the National Center for Population Statistics or the NBS to do 2 per thousand of the sample survey, known in two large 10 years between the census of population growth in China changes.

In fact, the big data we're talking about in theory should mean overall data, but in fact in many cases, for technical reasons, for man-made reasons, for other reasons, in my opinion, except for a few original owners of the data. For example, Taobao or Sina Weibo, or the national Grid or the education network this morning, they may have a real grasp of the overall data. For the vast majority of third parties, we did not get the overall, that many are local. This part may be a very high percentage, 70%, 80%, even if it is missing 10%, 20%, perhaps the local data and overall there is a big difference.

Our social science to do political research people look, in many cases, sample data, although its scale is much smaller, but actually more valuable than local data, more reliable. I did a very small simulation here, randomly producing 10,000 cases. I smoked 500 points from 10,000 points, randomly. The 500-point look is sparse, but it's good for the overall representation. I took a sample of 500 cases and calculated hundred points at the origin. Another 80% of the sample number, artificially set some boundaries, so that its right direction, its number has deviated.

The data is used to describe the objective event, generally divides the event into two kinds, one is normal, the other is abnormal. The most extreme is a few years ago, the Black Swan incident, theoretically exist, in fact, rarely occurs, if it happens, its impact on society is extremely disastrous consequences. If we have a total, there is no problem, and any event can be reflected in the data. If we are fortunate enough to get the overall data, Taobao, for example, Taobao's overall data is only on Taobao, it is not China's e-commerce all the data, as well as Jingdong, there are other. Who will say that my research object is the Chinese e-commerce Taobao data, and then a mass is a local data, not the overall data.

Anyway, if you happen to have such a general data, you don't have to worry about whether the events you want to describe are normal or heterogeneous, and don't miss things, and the results must be accurate and accurate. If we have samples, as long as your sampling method is in accordance with the principle of random probability, your basic conclusion, your sample on the overall is unbiased estimate, is accurate, but not necessarily accurate. As long as the sample is large enough, the overall gap is between 1%-3%. If we are going to look at some unusual events, such as the Black Swan event, the sample numbers often omit these anomaly data. Assuming that the local data, even if you are 80% of the total, or even 90%, you are studying the regular events, your representativeness may be poor. Local data, no matter how big you are, is the worst data. Why? Because you're missing out. All you choose is a mechanism of self selection, which is ubiquitous and affects the quality of our data all the time. In many cases, we do not know why we are missing some things, the lack of any data behind the social, economic, legal, natural and so on. Our point of view is to enlarge the data, or to do the overall, if you have the ability, or to do a sample, master your method, you can estimate your error. If you rely on local data, the mass size of the local data is confused, it is actually killing people.

There have been many cases in the history of our teaching process, every time we will talk about the United States 1936 years of presidential elections. The election of 2012 has just come to an end, and one of the most notable is the question of electoral projections. This forecast began in 1932 and is now 80 years old. That case has been a classic, with two companies making predictions. The first is a magazine, with the magazine sent a questionnaire, returned 2.5 million questionnaires, when the U.S. voters around the country is about 100 million, 2.5 million is already a significant number. They recovered and found that Brandon BIROSFO was 14% Higher and was elected with absolute advantage. And one of the world's best-known brands, which was just beginning a small research, polling center, surveyed 5000 people, he chose a random sampling method, predicted Roosevelt was elected, 56%. Roosevelt finally beat Brandon by overwhelming votes. People who send magazines to order magazines have a little more knowledge and more money, so they are more supportive of the Republican Party. But the random sample although small, 5000 people, probably have a little more than a percentage point, with the general has certain difference, but the big face is accurate. Only large, but do not pay attention to the representativeness of the data, the consequences more serious.

Now we're going to talk big data, this is a small sample of what we used to do by social scientists. Our data structure has n all cases, the ideal of large data, not only to many variables, and more cases, not only n, perhaps infinite. Perhaps my contact is limited, my observation is in real life we now use the big data is only a number of cases, variables are not many. In the morning someone's speech used 70,000 variables to predict loan, which is certainly a revolution. We used to say that there were only hundreds of variables, up to thousands, and that 70,000 of them came from somewhere, really amazing. Multiple cases, the less variable data structure is the basic environment in which we face large data. One of the reasons is that everyone occupies a small part of it, the so-called data island. The real way to get us to have multiple, variable-volume data, like this morning's Big Data Alliance, is through sharing and consolidation.

Quick, is actually the choice between effect and benefit. I'm still using the social sciences men's studies as a slow example and some of the basic methods of today's big data to compare. All we do is hand-annotated, and now it's all sorted automatically. From the scale is no way to compare, we generally a sample of only thousands of, now millions of is small, to the trillion is the normal. In terms of accuracy, the human will never exceed the machine. Some people have counted, I also observe, machine learning accuracy average about 80%, of course, some do natural language processing, do artificial intelligence will argue with me, say a certain project they can do 90%. But if you put all the research on average, 80% is a more optimistic sign. In the artificial situation basically can do 90%, 95%, the general academic journal accuracy rate below 95% is not acceptable. The question is how do you know your accuracy rate? Our general approach is to have two or more people with the same content individually labeled separately, and are not aware of each other on a back-to-back basis. Through their own reasons, the final calculation of the similarity of each other.

Most of the time if it is fully automatic even the accuracy rate is not known, and now a large number of online grasp to do the forecast, in the end is not allowed to predict, is always an unknown. In terms of errors, artificial judgments have errors, but these errors are personal errors, if a few people do at the same time, in fact, the error can be very small. Machine learning error is systematic, if you know where the bias, you can easily change it, the key is the error in where not know. This is what I just said, if we have got the local data, you do not know whether the local system error is to the left, or to the right side, high or low is not known. So, according to our view, the results are accurate, but not precise, and the so-called precision is not stable enough. The way to continue learning is the reverse, because you have a huge amount of data, millions of, tens of millions of, very accurate. In fact, precision of the word from the English language only to say not accurate, fine and not to be a big data existing problem. It is natural to think that we need to combine manual tagging with automatic classification, supervised machine learning. Machine learning includes the quality of the training set, the size of the training set, and your algorithm, and the importance of these three is to follow this rank.

Province, in the end is the province's people, or the province of energy. Large data are sure to save people, but save people at the same time in energy consumption. is also an environmental problem I do not want to say more, in fact, is amazing. If we start planning now and don't pay attention, maybe a few years later the big data will become our new polluting industry, polluting heavy industry. Indeed, I just heard this morning about where to set up a large data center to millions of servers. You can imagine the energy consumed and the radiation it produces, I'm not studying physics, I believe the radiation is terrible. In fact, the increase in the number of data, the annual rate is far more than we now the speed of dust removal capacity. In this case, unless we have a breakthrough in dust removal materials, we have to think of a problem, in the end to keep the overall data? Unicom can only save 4 months, or sampling, to make large data smaller.

This question is at the heart of all the questions and there is no answer. I just want to say a few questions. First, the big data is good, but the big data is where. If we don't get the big data, it's a cake in the window, and we can only look outside. In my opinion, we can divide large data into several, small, medium-sized, mega-scale. A very large number of small, free on how to get. In most cases it is free or low-cost. The real big data is not in fact. It is important to consider this problem when it is applied or used as a tool. This refers to the data analysis tools, in fact, the analysis of large data tools are not developed, we now use most of the tools are used to solve the problem of small data, used to solve the normal data on the statistical tools of heterogeneous data. Now, basically, someone has published an article about some important ideas for large data correlation, which is equivalent to the age of a small data-related relationship that Gauss put forward more than 100 years ago. In other words, our ability to deal with large data is still in its infancy, now we can only analyze two yuan, not multiple. 70,000 case models I have not read, perhaps we are the academic people are more conservative, I do not believe that they can do, because now use is the traditional classic tools.

What do you think of the big data? My views are not all optimistic, not all pessimistic. The existence of the data is certainly a new century, the era, from the potential value of data is also exist, but the application of data, data sharing, in fact, there are many problems. Data dust and data analysis and research, is actually just beginning, the application is now far ahead of the study. I am very happy to discuss the exchange with you, thank you!

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.