The confusion of big data

Last Update:2015-03-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It has been almost 2 years since the big data was exposed and the customers outside the Internet were talking about big data. It's time to sort out some of the feelings and share some of the puzzles that I've seen in the domestic big data application. Clouds and large data should be the hottest two topics in the IT fry in recent years. In my opinion, the difference between the two is that the cloud is to make a new bottle, to fill the old wine, the big data is to find the right bottle, brew new wine.

The cloud is, in the final analysis, a fundamental architectural revolution. The original application of the physical server, in the cloud into a variety of virtual servers in the form of delivery, so that computing, storage, network resources can be more efficient use of. As a result, drinking alcohol is not a good drink can be used a bowls cow drink erguotou, small and want to taste the small drunk people can also take a small cup bite baba daughter Red.

The difference between big data is that it actually picks up the data that people have discarded before, and then analyzes and uses it to make new value. In other words, the original 20 kg of grain can only be 2 kg of distiller's grains, now 20 pounds of food has become or most of the lees. Of course, this distiller's grains will certainly be different from the original distiller's grains, so the wine must be the same as before, alcohol, wine, liquor storage methods are naturally different.

So, relative to the cloud, people are more confused about the use of large data. Next, let's talk about some of the biggest puzzles I've seen and what we have now.

One of the puzzles: what can big Data do?

For example, drink in front of the wine in order to make a good drink. There is no longer a need to discuss what data is big data. Here's a chart of Gartner's survey of large data needs in various industries, which are classified for 3 V that are common to large data, and requirements for unused data. It can be seen that almost all industries have a wide range of requirements for large data.

Pictures from Gartner

The reason for these requirements is that these types of data were previously not collected because of technical and cost reasons. Now that there is a cost-effective way to get you to collect and process the data, how can you say no? Or to make the analogy of brewing, before brewing two Jin distiller's grains to waste 18 pounds of grain, now at least 20 kg of grain can be 10 pounds into distiller's grains, although these lees may be different from before, but at least can waste 8 kilograms of grain.

Now the problem is that there are more distiller's grains and different kinds, how to make wine according to the new distiller's grains? Sorry, this is the problem. But the problem is that all of the wine shops may now face the same problem, so no one can teach you, only to explore themselves. This is now various industries face large data of the biggest confusion---massive data collection do not know how to use!

Here's why there's no such confusion in the traditional data warehousing field. The following is a good illustration of the difference between tradition and present:

Pictures from Sogeti

The process shown above shows that the root cause of the confusion is that the hard it workers are walking in front of the business decision makers (weeping). In traditional times, business people want to get a certain type of statistical reports or analysis forecasts, so IT industry staff to meet their needs to find solutions, write algorithms, resulting in various types of data warehousing and solutions. And now, with the help of the Internet, IT staff have found that we can store a huge amount of data that could not be processed in new ways, but the business people are not ready. So, when you tell them, "Hey, buddy, I've got a lot of data here to help you now." "They confused don't know what the data is for them.

How to solve this problem? First look at the traditional manufacturers Oracle, IBM how they do it. The details of the way are slightly different, but their thinking is basically as follows:

Photo from HP chief technical expert Greg Battas at ABDS2012 Conference

In simple terms, this approach is to introduce Hadoop and other types of newsql, NoSQL scenarios into an existing data analysis solution architecture in ETL, or external tables. This scenario because the top data warehouse is not significantly changed, customers can continue to use the original algorithm and report structure, that is, on the new data platform to continue to follow the old application scenarios and analysis methods. The advantage is that with the introduction of large data technology, can handle a variety of data sources, while reducing the cost of the original mass data ETL. But there are still many problems with this approach:

Problem one: Performance bottlenecks remain. Looking at all kinds of newsql, nosql schemes, distributed is one of the most notable features. The reason why we all adopt the distributed architecture is because of the traditional vertical expansion scheme, the performance cannot be linearly expanded with the increase of data volume, or the cost is too high when processing massive data. And the plan above, although through Hadoop to solve the performance bottlenecks of ETL, but bi or traditional data warehouse, a large number of ETL makes the original data warehouse need to deal with the data volume increased, so must spend a great deal of time to upgrade the original data warehouse, otherwise the analysis will run slower than the original. Therefore, users still need to upgrade the expensive upper-level data warehouse, to the original efficiency of the general algorithm compromise performance.

Question two: Big data investments are wasted. In the old analysis application scenario, the algorithm is based on the relational database. There are two main types of logic patterns that differ from large data schemes.

The difference between Shari and polishing jade. I have raised the example of spicy chicken to describe Hadoop, roughly saying that a hot chicken is big data, Hadoop is a spicy chicken to remove the pepper, find the way to eat chicken. In fact, the processing of large data is to help you to the process of gold. Previously not so suitable "sieve", so can only give up the dream of gold in the sand, now with the right "sieve", you can go from the beach more efficient and quick to find those "flash" things. But the traditional data processing way, actually already through the artificial, the half artificial way, has done a lot of sieve-picking work. So although discarded a lot of data, but the remaining data is already a piece of "Rough Jade", to do just this piece of "Jade" and then carved fine peck, so that it becomes the value of the "Jade". So, with the traditional data processing methods to deal with large data, is to take the knife to kill a cow, even if someone to help you end plate parts, not kill the cattle are exhausted.

The difference between a EMU and a train. The core idea of a distributed large data architecture is the same as that of a WAN adaptation: The branch is built into the company. Distribute the effective force of the party to each combat unit, greatly improve the implementation of the central strategy, improve the mobility and combat effectiveness of each unit. Is the reason why the train is faster than the trains: every car has power, although each section is not stronger than the locomotive, but the more the car will run faster. And the locomotive was strong and there were no more carriages to drag. The existing analysis algorithms, often for the "locomotive" type, many times there is no way to split into a lot of small operations distributed to each node. So, if you follow the previous algorithm, then you must add additional software solutions to the already distributed data again "concentrated", the additional link, certainly time-consuming and laborious, the effect is not likely to be good.

In my opinion, the traditional vendors ' solution to the puzzle of large data application is not the best solution. What is the best solution? In fact, it is very simple to develop new application analysis scenarios for new datasets and database structure features, and to run these analysis applications directly to large data architectures. Instead of going to fit, take the new newsql, NoSQL grafting traditional program.

The advantage of doing so is self-evident, the key is how to achieve? These things can't be told to the business by the people who do it. Large data applications to really take root in the enterprise flowering, really need some data scientists do demand generation (Demand generation) work. We're going to flip through their help to get the big data paths in this picture, like the traditional data processing, by business people telling us what they want to do!

I've been with a lot of customers, and all I've got is the need to know about Hadoop or the memory database. But then they found out that they didn't really know what Hadoop or the memory database could do for them, and hopefully we can tell them. But frankly, this is not what we do for the IT infrastructure. We have been "ahead" of the reserves of such technical means, how to use this kind of technology is really should understand the business people to think, not us.

So, here I want to appeal to the IT industry, the top of the pyramid of professional consultants, data analysts, data scientists, now is the time to step out of the original framework to see new technologies under the new architecture of some opportunities. Do not always shackle the traditional ideas and methods, so that new large data ideas to do "fit" things. Sincerely hope that you can use professional knowledge and industry experience, to help those "big data if thirsty" industry users to properly locate the real value of their new applications, design more meaningful distributed algorithms and machine learning model, really help them solve the big data application of the confusion.

Puzzle Number two: What is different between large data schemes and what should I use?

First, the customer must think clearly about the previous question, what they want to do, and what function to achieve. Then we can break this requirement down into smaller requirements:

How many data types do you want to work with?

How large is the amount of data to be processed?

How fast do we have to deal with it?

These three requirements have a more definitive answer. The chart, which is based on the timeliness of data processing and two dimensions, classifies traditional RDBMS and Hadoop, MPP, memory databases, and other large data schemes. This classification is for the more typical scenarios in various categories. Now the actual situation, especially MPP and Hadoop, the distribution features are different features, so the processing of the scene will be the extension of each direction.

In the big Data age, an architecture all situation is unlikely to occur. The future enterprise big data overall plan, certainly is many kinds of database scheme structure coexist. Enterprise data can be interoperable across different schema architectures, operating on different database architectures based on different analysis tools for analysis scenarios.

Photo from Nomura Cato

Since there will certainly be a variety of data sources in the future enterprise, a variety of database structure, then whether you can establish a middle of the data services layer, the application and the underlying database structure separated away? It's as if you're in a hurry to work and you don't have time to buy food, so write a menu to the hourly maid and give him the money to buy it for you. You don't have to worry about whether she will go to the roadside market or buy it at the supermarket. This idea looks very good, but I think the implementation of the enterprise in the difficulty is relatively large, not very realistic. Why do you say that? Here's just some of my ideas.

Take a look at the most skillful Internet applications for big data: simplicity, directness. What kind of data, in which way to store the most efficient, to deal with the quickest way. What can be done directly on the file system is not placed in the database. The analysis of the data is the same, the less the better the structure, the more direct data access, the problem can be directly solved with the programming language is determined not to use the Data Warehouse SQL. The problem with SQL solution is not to run through Java or Python for a unified interface. All with high efficiency directly as the premise, fully implement the "branch into the company" core ideas, play the advantages of small fast spirit. In the case of Hadoop, many Internet or distributions are beginning to try to give up map/reduce direct manipulation of HDFs, and the idea is to be more straightforward and concise. Therefore, the previous "set up a data service layer" or traditional enterprise old way of thinking, hope that through the establishment of the middle tier to reduce the difficulty of development transplant, in fact, the result is not to play a large data architecture itself performance and scale advantages, limited the development of the technical framework itself space. The reason to mention this topic, mainly is to draw the next industry for large data confusion.

Puzzle Three: How can we migrate from traditional relational data architectures to large data architectures?

This problem, I think no one can give the perfect answer, because now some of the new enterprises, such as the Internet, the face is mixed data large data environment, there is no problem of migration. And they have to deal with the type of data, the application scene is not the same as the traditional enterprise, only a certain reference meaning, complete replication is unwise. Traditional large enterprises, now most of the foreign companies themselves in stones, domestic enterprises have just opened a head. In fact, we are in the groping process, the front of the basic no point of the beacon, only a little spark can be used for reference.

Who can help you? I think it is the people who engage in business consulting. At least they can see many successful or unsuccessful cases of similar companies abroad. But the premise is that they really stand in neutral position to help you to start the analysis of the new application scenario planning.

On this issue, I also share personal views, for reference only.

The first step: Save the Big Data first, use it. Now I have seen many traditional enterprises to invite various types of consultants to do large data strategic planning, I am not qualified to evaluate the feasibility and problems of these plans, but I think the first thing to accept the new things, I have to do is to taste a fresh, rather than know its future. If the result of a small test is not good, then the cost of adjusting the start again is relatively small. So my suggestion, first of all, find a solution to the data you are prepared to analyze and deal with a new way to save, and then try to do some simple query, comparison and other applications, see the effect is good, leading to buy do not pay. If the effect is good, then try to implement a new business scenario on this, to address some of the actual needs of the business people; if the effect is good, try to do a second application, the third analysis ... Slowly let more and more people see the value of these new data new applications.

Second step: Consider the new large data platform and the existing data platform interoperability, joint issues. Here are two aspects:

Run the old application analysis on the new large data platform. The data is extracted from the original RDBMS data source to the new large data platform, and the traditional business analysis logic is realized by the new large data analysis method. It is possible to analyze more data to produce better results, and it is possible to find efficiencies less efficient than the original RDBMS scheme.

The data from the large data platform is extracted into the old data warehouse to analyze and show. This direction is mainly to ensure that the old user's SQL usage, the difference is that the old data Warehouse is not the external table, but cleaned and collated valuable data.

Through these two aspects of the attempt, the basic can be what applications can be migrated, which can not be migrated to understand. Lay a solid foundation for the next step.

The third step: data source integration, analysis of the application scene customization. With the basics of the first two steps, you know exactly what types of data you can handle and what business value they will bring to you. Then we can start the "all-out attack".

The first step is to integrate the data source, will be involved in various types of data classification, with the most appropriate method of storage to organize well. Then, the application and presentation tools are coupled with different data storage architectures according to the different data sources involved, and the application scenarios are customized so that each application can fully utilize the performance and scalability of the underlying architecture. For the application scenarios that need to cross the data source, the intermediate processing layer scheme is selected to ensure the customization of the intermediate processing layer scheme, and it will not affect the performance of the underlying architecture and the implementation of the upper layer analysis application.

Such a step, there is no way to let enterprise leaders see "the next 10 years of the IT architecture of the Grand Blueprint", but the operability is relatively strong, and one step does not change the opportunity of adjustment is also relatively large. This kind of thinking belongs to the Internet and emerging industry that "small step run" thinking mode, take a few steps to see, if not also have valuable lessons, the cost is not very big.

Generally speaking, I can feel, industry users of large data confusion is the above three aspects. The reason for this confusion, in the final analysis, is that large data is handled in a way that is too different from the traditional way it used to be.

Hadoop as a representative of the large data processing system, in fact, is to take an extensive way to deal with massive data, the principle of machine learning often rely on a large number of samples rather than accurate logic. For example, we often say "the Qingming Festival rains", there is no logical and scientific formula to deduce this conclusion. The reason for this conclusion is that countless working people through years of observation, from the "massive" qingming climate samples found that every few days it is always raining more. And why the Qingming Festival will rain these days, but no one to carefully analyze. Large data processing is similar, it relies on the experience of the predecessors, historical data, summed up, rather than rely on some complex formula calculus. It relies on, is "the sample" many, and can through the technical means quick and efficient analysis to organize the massive sample. And because there was no way to deal with so many samples, can only rely on advanced sophisticated mathematical models. So, want to use good big data, one is to adjust the train of thought, as far as possible in a simple way to deal with a large number of data; second, in some cases, it may be necessary to consider the data by means of multiple sampling.

Therefore, enterprises want to use good data, in the Sand sea gold, it should be bold to discard the original set of mature structure and program. From scratch, really think about so much data, these new methods for the enterprise can have what significance, what value. Then, is to put ideas in the hadoop,mpp and so on the implementation of the architecture, landing, once found that there are problems immediately adjust, start again. Instead of looking at what other people are doing first like before, then do a dozens of-page "look beautiful" ppt and paint a beautiful pie for the next ten years. To learn more about the Internet and emerging industries, change ideas, linked business, live in the present, small steps run.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The confusion of big data

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support