To understand the concept of large data, first from the "Big", "big" refers to the scale of data, large data generally refers to the size of the 10TB (1TB=1024GB) data volume above. Large data is different from the massive data in the past, and its basic characteristics can be summed up with 4 V (Vol-ume, produced, and #118alue和Veloc-ity), that is, large volume, diversity, low value density and fast speed.
Large Data features
First, the volume of data is huge. Jump from TB level to PB level.
Second, a wide range of data types, such as the aforementioned blog, video, pictures, geographical information, and so on.
Third, the value density is low. In video, for example, a continuous uninterrupted monitoring process may be useful for only two seconds.
Four, the processing speed is fast. 1 Second law. This last point is fundamentally different from the traditional data mining technology. Internet of Things, cloud computing, mobile networks, car networking, mobile phones, tablets, PCs, and a wide variety of sensors all over the globe, none of which is a source of data or a way of hosting it.
Large data technology refers to the rapid acquisition of valuable information from a wide variety of types of data. The core of solving large data problems is large data technology. The current "Big data" refers not only to the size of the data itself, but also to the tools, platforms and data analysis systems that collect the data. The purpose of large data research and development is to develop large data technology and apply it to relevant fields, and to promote its breakthrough by solving huge data processing problems. Therefore, the challenge of the big data age is not only embodied in how to deal with the huge amount of data to obtain valuable information, but also in how to strengthen the development of large data technology, seize the forefront of the era.
Large Data function
With the advent of the big data age, more and more people are agreeing to this judgment. What does the big data mean, and what does he change? Only from a technical point of view, the answer is not enough. The big data is only an object, leaving the subject of the person, it is no meaning. We need to put large data in the context of human perspective, understand it as the force of the Times change.
The power to change value
In the next ten years, it is the national happiness that determines whether China has a core sense of wisdom (the "thinker"). One manifests in the people's livelihood, through the big data lets the meaningful matter become clear, see whether we have made more sense in relation to people than ever before, and in ecology, by making meaningful things clear through large numbers, and by seeing whether we have made more sense in our relationship with people than before. In short, let us from the first 10 years of the meaning of the chaotic era, into the next 10 years the meaning of the era of clarity.
The power to transform the economy
Producers are valuable, and consumers are the meaning of value. Meaningful only value, consumers do not agree, they can not sell, it can not achieve value, only consumers agree, only sell out, to achieve value. Big data helps us identify meaning from the source of the consumer, helping producers realize value. This is the principle of starting domestic demand.
The power of the change organization
With the development of data infrastructure and data resources with semantic Web features, the change of organization becomes more and more unavoidable. Large data will drive the network structure to produce unstructured organizational power. The first to reflect this structural characteristics, is a variety of central WEB2.0 applications, such as rss, wiki, blog and so on. The big data is the power of the Times to change, because it obtains wisdom by following the meaning.
Large data processing
Dr Zhou said: The three major shifts in the idea of a large data-processing database: To not sample all, to be efficient, not to be absolutely precise, to be relevant and not to cause or effect.
The process of large data processing
There are a lot of concrete data processing methods, but according to the author's long time practice, summed up a general application of large data processing flow, and this process should be able to straighten out the large numbers of processing help. The whole process can be summarized as four steps, namely collection, import and preprocessing, statistics and analysis, and finally data mining.
One of the large data processing: acquisition
Large data acquisition refers to the use of multiple databases to receive the data from the client (Web, app or sensor form, etc.), and users can do simple query and processing work through these databases. For example, the ICC uses traditional relational databases such as MySQL and Oracle to store every transaction data, and in addition, NoSQL databases such as Redis and MongoDB are often used for data collection.
In the process of collecting large data, its main characteristics and challenges are high concurrency, because at the same time there will be tens of thousands of users to access and operations, such as train ticket ticketing website and Taobao, their concurrent visits to the peak of millions, so need to deploy a large number of databases on the acquisition side to support. And how to load balance and fragment between these databases is really a need for in-depth thinking and design.
Large data Processing bis: Import/preprocessing
Although the acquisition end itself will have a lot of databases, but if you want to effectively analyze these massive data, or should be the data from the front-end to a centralized large distributed database, or distributed storage cluster, and can be introduced on the basis of some simple cleaning and preprocessing work. There are also users who use storm from Twitter to stream the data in the import to meet the real-time computing needs of some businesses.
The characteristics and challenges of the import and preprocessing process are mainly the amount of data that is imported, and the number of imports per second often reaches hundred megabytes or even gigabit levels.
The third of large data processing: statistics/analysis
Statistics and analysis mainly utilizes the distributed database, or the distributed computing cluster to carry on the common analysis and the classification summary to the massive data which is stored in it, in order to satisfy the most common analysis demand, in this aspect, some real-time demand will use the EMC Greenplum, the Oracle Exadata, And MySQL-based column storage infobright and so on, while some batches, or the need for semi-structured data, can use Hadoop.
The main features and challenges of this part of statistics and analysis are the large amount of data involved in the analysis, which will occupy the system resources, especially I/O.
Large data processing four: excavation
Unlike the previous statistics and analysis process, data mining generally has no pre-set themes, mainly on the existing data based on the calculation of various algorithms, so as to play a predictive (predict) effect, so as to achieve a number of high-level data analysis requirements. The typical algorithms have Kmeans for clustering, SVM for statistical learning and Naivebayes for classification, and the main tools used are mahout of Hadoop. The characteristics and challenges of the process are mainly the complexity of the algorithm used for mining, and the computation involves a large amount of data and computation, and the common data mining algorithms are single thread.
The general process of the whole large data processing should at least meet these four steps in order to be a relatively complete large data processing.