The hottest three key words in the big Data age are: Cloud, big data, analysis. The heat of cloud computing does not need to repeat, because no matter you look at Weibo or browse the site, if three pages can not see a cloud word, that means you must not be in IT industry.
However, people often see cloud computing, and do not know how to do, what kind of things. Cloud computing, if not used to do analysis, then you can only cloud, the cloud, never to the cloud for rain.
What is large data? What is the rationale?
Let's take a look at the history of the word big data.
In the 60 's, when people brought up the word, they were only data. The 70 's created a word called the database, from the data into databases, the word library is very large. 1975 created a word called the vldb,80 era and created a word called data warehouse,http://www.aliyun.com/zixun/aggregation/8302.html "> data warehouses are larger than databases." By the 90 's, people began to do something about the data in the data warehouse, called Data mining Mining. With the end of the 90 and the middle of the new century in the 2000, the Internet industry social media and cloud computing and other technologies continue to develop, people feel that these words are not enough, so there is big data. Now in foreign countries have created a call extreme data, which means that large data is not enough, and extreme data, why?
In my opinion, the big data is just 30 or 40 years ago the understanding of the data, and the data management and use of the means of continuous improvement, so do not call these words, we first look at what the big data has done.
In China, a telecom operator SMS Business data, each year 7000多亿条 SMS, 5 years down is more than 3 trillion, in fact, these messages there are still a lot of value can be dug. Is it hard to dig this thing? More than 3 trillion rows of data, do not say how wide the table, first of all, to do some mathematical statistics, analysis, it is already a very difficult thing.
4V Theory of large data
At present, the industry has summed up some characteristics of large data, here, I used 4V theory to summarize.
The first V, the volume of data, determines the base shelf of large data;
Second V, speed. Even in the size of large data volumes, companies are still asking to be able to do some analysis quickly.
The third V is the data type. In the past, we often do some very simple data sets in the financial and telecommunication industry, such as number, name, age, the number of the caller, the number called, the main call time, etc. very structured, very neat data. Now a lot of data is not structured, is semi-structured, such as some text, micro-blog information on how to analyze, this is the big data to solve the problem;
The fourth V, the mutation, is that no matter how people imagine these data, it is always changing, the faster and bigger the change, the greater the challenge to our ability to deal with. Now we have not only micro-blogging, but also micro-letter, has been added to the sound of the signal, there are images and even video. How to send out through a text message or MMS, how to deal with this information, is actually a problem we have to face.
Large data needs Analysis cloud Platform
Analysis of the word, in large data or cloud computing, must mention a strategic height to understand the word. If your cloud computing platform doesn't consider how to analyze some of the data stored, what are you saving? If you can't dig out the value of this, how do you distinguish between a gold mine and a garbage heap? Is it useful for me to save a lot of rubbish? Of course, no use.
One of the problems with large data is how to collect data quickly. Data collection is very difficult, compared with the current database level and the trend of data growth, we can see that the data growth rate than we now database processing capacity is much greater.
Here, you can see some well-known key words, such as Hadoop, MapReduce, such as Sybase IQ representative of the column database, as well as the Sybase event stream processor Events flow processor, how to real-time convection data processing, Are some of the technologies that companies now need to master.
Large data analysis also has a number of peripheral, extension tools, such as Matlab, SAS, SPSS or now very fire revolution R. Open source inside there are hive, Scipy,mahout, Ampl and so on these technologies, in different fields there are many people in the study, in the analysis.
The mining of information value has many methods and means, such as how to do social media analysis, how to do behavioral analysis, emotional analysis. There is the business scene personalized service, personalized analysis, personalized recommendations and so on.
Now the database market in the face of such large data, such complex data types, and so fast changes in front of the eminence is no longer a pattern, there is no database products, or a database product can completely solve the problem of large data. What might be the pattern of the future?
In an enterprise or an architecture like IDC, you have to face the state of a toolbox, which has a variety of tools, each with a different experience point that is almost irreplaceable. Now the database market is also faced with a pattern, many times to do OLTP, to use the row-type database, do a lot of data analysis to use the column database, because it can bring 10 times times, a hundredfold speed increase.
So the real-time processing of large data, we have to use as a data stream analysis database, memory database; On the phone or some mobile devices to do some small applications, we need some embedded database, as well as object-oriented database and so on. In the large data processing pattern, we must accept the view that the special database is used to solve the special problem.
(Responsible editor: The good of the Legacy)