On the big data, I have a few views: first, because the big data is just hot, so don't be anxious to conclude. When it was just developed, Professor Robert Solow, the authoritative Solow on growth, proposed a "Solow paradox": "We see computers everywhere, and we can't see productivity statistics." "Until 15 years later, by 2002, he openly admitted:" I now find it may contribute to productivity. "The study of large data may also take a long time to confirm value," he said.
Second, the possibility is not equal to feasibility. Now there is a point: "Is it a big data or a bluff?" Everything is big data. In fact, what is said or imagined now is "possibility" rather than "feasibility". What time is the "feasibility"? I don't see it now. It needs reasonable system arrangement, also need enterprises, companies continue to carry out business practice, trial and error, and scientific research workers on the large data analysis technology continuous improvement.
Third, the current research is mainly the question stage, rather than solve the problem. Of course, if we can put forward good questions, this is a good result of research.
Finally, this article as far as possible more facts, less reason, to provide a little more material, less to provide a point of view.
The emergence, connotation and controversy of large data
First of all, where is the big data? In fact, large data is always present in different places. For example, everyone has a lot of data: height, weight and so on, including ideas, ideas. But there used to be no internet, so it's hard to get the data to apply. Data analysis existed very early. In the spring and autumn bin had used the other camp to do the number of stoves to judge the number of opposing armies, thus directing war. However, at that time such data is very small, there is the use of talent will become the wisdom of the Times.
However, the situation is different now. Since the beginning of the Internet, data has been growing in 2005, and it is basically an exponential growth after 2010, with more than 4 ZB in 2013 and an annual growth rate of more than 50%. This is a process from quantitative change to qualitative change.
Why didn't you say big data before? This is a relative concept, the concept of "big" pops up one day when its growth rate is suddenly very fast. So it is not really a strict academic concept, just because in the process of quantitative change people feel this qualitative change, or feel that it has value.
Major sources of data
The main sources of data, in general, have two aspects:
First, the data of the object.
One of the more representative is the networking of things made up of sensors, a business model proposed by IBM (189.64, 0.49, 0.26%) in 2009, when it was called "intelligent earth". is to put the sensor on a different object, and then display its various data, such as temperature, humidity, pressure and so on. Internet of things in recent years, the growth rate is faster, can reach 20%-30% growth rate, the data is constantly increasing.
Second, the human data.
The most typical of these is the development of mobile Internet. In recent years, the proportion of mobile internet to Internet traffic is getting higher, and the proportion of mobile end users sending data is greatly improved, which is also a very important source of large data. With these moving data, you can judge a person's career, interest, quality, or position in every moment of the day, that is, depending on the data to find each individual's situation very accurately.
Why does the data suddenly increase massively? The first is the decline in it costs, in addition to the two-year rise in the use of cloud computing has a strong relationship. From Amazon (313.65, 6.59, 2.15%) The volume of flexible cloud storage growth is visible, from 2006 to 2013 the increase in volume is very significant, by 2013 two quarters, there are already 2 trillion volumes of documents stored in the elastic cloud.
So why does cloud computing reduce IT costs? Based on our previous year's practice research data, first of all, from the demand side, in the past to buy some hardware, including servers, computers and so on, the cost is more expensive. But a cloud computing system that centralizes IT resources and uses it in a rented way is much cheaper than buying it. From a supply point of view, when all IT resources are concentrated, there will be a very significant economies of scale, as many servers are operating at the same time (technology based, of course), with significantly lower costs.
There is also the concept of a scope economy: when the IT resources are concentrated, not only the economies of scale, but also the management of a variety of resources. For example, a search may take up a lot of CPU computing resources, but the disk resources may not be that much; These two efficiencies can be obtained at the same time when it is used in a centralized way. So it's a contribution to cloud computing's decline in it costs.
Big Data Four "V"
About the definition of large data, the most talked about is the so-called four "V", there are also five, six "V" said. The first of the four "V" that the IDC (Internet Data Center) boils down to is its actual scale. From the previous KB, to TB, until the later PB, EB, the amount of data is constantly increasing, this is a surface phenomenon.
The second "V" is a variety of data types, especially with a large amount of unstructured data. What is unstructured data? For example, send a micro-letter on the Internet, this sentence itself can not be used to do statistical or econometric analysis, but could be extracted from the structure of the data analysis. Such data is an important part of the volume of data.
The third "V" is about value, there are two points: one is the value is big, the big data brings various possibilities; Another important thing is that although it is very large and valuable, it has a very low density. 1GB of large data crawled on the Internet, which may be useful only 1 per thousand, one out of 10,000, or one out of 10,000, so mining and analysis is more difficult than the original.
The fourth "V" is the fast processing of dynamic data. In this respect the contribution of cloud computing is relatively large, here is the core, but also large data can not from "possible" to "feasible" transformation of the two elements, namely: unstructured and low-density. The two are actually interrelated, and if the technology can solve the problem of how to analyze unstructured data and how to extract data from low-density values, then the application of large data can grow by leaps and turns. So, I think unstructured and low-density can be the core of big data.
So what's the big data? If glimpse, from the point of view it, first of all, large data "big" is definitely a relative concept, it is not an absolute concept. In addition, it is not an academic concept, and it needs to be concerned that unstructured data may account for the main parts of large data, especially the interactive data from netizens may be one of the main subjects of future large data.
From the analysis method, the method used to get the data or statistics is sampling, and then using mathematical methods such as probability theory and stochastic process to infer, to achieve the goal, to infer all the data. It is now possible to get all the data if the cost falls lower.
Questioning Big Data
There are, of course, a lot of questions about big data. First, a "Big Data trap" was proposed. The more data, the better? In fact, for any enterprise or individual, the data is certainly not the more the better, there must be an optimal amount of data, because to analyze a large number of data, the method is not possible? How high is the cost of analysis? How much value does this large amount of data contain? Therefore, for each enterprise has an optimal amount of data, that is, from the range of data obtained from the value and to obtain these values to pay the analysis cost, they two close to equal, may be the best data.
Then again, Professor Kate Crawford of MIT suggested that "there are biases and blind spots in Big data": data is not equal in the process of generating or gathering, and there is a "signal problem" in large datasets, that is, that some people and communities are neglected or not fully Kate Claufford. This is typical, for example, the country now has more than 600 million netizens, sometimes can not use 600 million of data to judge the status of 1.3 billion people, because the process is not by sampling.
The third problem is that "revealing personal privacy is a growing concern". It is terrible that the data is taken away when we do not know it.
(Responsible editor: Mengyishan)