The confusion of big data

Source: Internet
Author: User
Keywords Large data large data can large data can they large data can they traditional large data can they traditional now
Why do traditional companies show more confusion when it comes to big data? The reason is that business decision-makers are not aware of the value that large data can bring to the business, or how to learn and use large data analysis tools. And these big data tools are there, who can learn to use the first step, who will occupy the opportunity.

It has been almost 2 years since the big data was exposed and the customers outside the Internet were talking about big data. It's time to sort out some of the feelings and share some of the puzzles that I've seen in the domestic big data application.

Clouds and large data should be the hottest two topics in the IT fry in recent years. In my opinion, the difference between the two is that the cloud is to make a new bottle, to fill the old wine, the big data is to find the right bottle, brew new wine.

The cloud is, in the final analysis, a fundamental architectural revolution. The original application of the physical server, in the cloud into a variety of virtual servers in the form of delivery, so that computing, storage, network resources can be more efficient use of. As a result, drinking alcohol is not a good drink can be used a bowls cow drink erguotou, small and want to taste the small drunk people can also take a small cup bite baba daughter Red.

The difference between big data is that it actually picks up the data that people have discarded before, and then analyzes and uses it to make new value. In other words, the original 20 kg of grain can only be 2 kg of distiller's grains, now 20 pounds of food has become or most of the lees. Of course, this distiller's grains will certainly be different from the original distiller's grains, so the wine must be the same as before, alcohol, wine, liquor storage methods are naturally different.

So, relative to the cloud, people are more confused about the use of large data. Next, let's talk about some of the biggest puzzles I've seen and what we have now.

One of the puzzles: what can big Data do?

For example, drink in front of the wine in order to make a good drink. There is no longer a need to discuss what data is big data. Here's a chart of Gartner's survey of large data needs in various industries, which are classified for 3 V that are common to large data, and requirements for unused data. It can be seen that almost all industries have a wide range of requirements for large data.

Pictures from Gartner

The reason for these requirements is that these types of data were previously not collected because of technical and cost reasons. Now that there is a cost-effective way to get you to collect and process the data, how can you say no? Or to make the analogy of brewing, before brewing two Jin distiller's grains to waste 18 pounds of grain, now at least 20 kg of grain can be 10 pounds into distiller's grains, although these lees may be different from before, but at least can waste 8 kilograms of grain.

Now the problem is that there are more distiller's grains and different kinds, how to make wine according to the new distiller's grains? Sorry, this is the problem. But the problem is that all of the wine shops may now face the same problem, so no one can teach you, only to explore themselves. This is now various industries face large data of the biggest confusion---massive data collection do not know how to use.

Here's why there's no such confusion in the traditional data warehousing field. The following is a good illustration of the difference between tradition and present:

Pictures from Sogeti

The process shown above shows that the root cause of the confusion is that the hard it workers are walking in front of the business decision makers (weeping). In traditional times, business people want to get a certain type of statistical reports or analysis forecasts, so IT industry staff to meet their needs to find solutions, write algorithms, resulting in various types of data warehousing and solutions. And now, with the help of the Internet, IT staff have found that we can store a huge amount of data that could not be processed in new ways, but the business people are not ready. So, when you tell them, "Hey, buddy, I've got a lot of data here to help you now." "They confused don't know what the data is for them.

How to solve this problem? First look at the traditional manufacturers Oracle, IBM how they do it. The details of the way are slightly different, but their thinking is basically as follows:

Photo from HP chief technical expert Greg Battas at ABDS2012 Conference

In simple terms, this approach is to introduce Hadoop and other types of newsql, NoSQL scenarios into an existing data analysis solution architecture in ETL, or external tables. This scenario because the top data warehouse is not significantly changed, customers can continue to use the original algorithm and report structure, that is, on the new data platform to continue to follow the old application scenarios and analysis methods. The advantage is that with the introduction of large data technology, can handle a variety of data sources, while reducing the cost of the original mass data ETL. But there are still many problems with this approach:

Problem one: Performance bottlenecks remain. Looking at all kinds of newsql, nosql schemes, distributed is one of the most notable features. The reason why we all adopt the distributed architecture is because of the traditional vertical expansion scheme, the performance cannot be linearly expanded with the increase of data volume, or the cost is too high when processing massive data. And the plan above, although through Hadoop to solve the performance bottlenecks of ETL, but bi or traditional data warehouse, a large number of ETL makes the original data warehouse need to deal with the data volume increased, so must spend a great deal of time to upgrade the original data warehouse, otherwise the analysis will run slower than the original. Therefore, users still need to upgrade the expensive upper-level data warehouse, to the original efficiency of the general algorithm compromise performance.

Question two: Big data investments are wasted. In the old analysis application scenario, the algorithm is based on the relational database. There are two main types of logic patterns that differ from large data schemes.

The difference between

Shari and polished jade. I have raised the example of spicy chicken to describe Hadoop, roughly saying that a hot chicken is big data, Hadoop is a spicy chicken to remove the pepper, find the way to eat chicken. In fact, the processing of large data is to help you to the process of gold. Previously not so suitable "sieve", so can only give up the dream of gold in the sand, now with the right "sieve", you can go from the beach more efficient and quick to find those "flash" things. But the traditional data processing way, actually already through the artificial, the half artificial way, has done a lot of sieve-picking work. So although discarded a lot of data, but the remaining data is already a piece of "Rough Jade", to do just this piece of "Jade" and then carved fine peck, so that it becomes the value of the "Jade". So, with the traditional data processing methods to deal with large data, is to take the knife to kill a cow, even if someone to help you end plate parts, not kill the cattle are exhausted.

The difference between a
emu and a train. The core idea of a distributed large data architecture is the same as that of a WAN adaptation: The branch is built into the company. Distribute the effective force of the party to each combat unit, greatly improve the implementation of the central strategy, improve the mobility and combat effectiveness of each unit. Is the reason why the train is faster than the trains: every car has power, although each section is not stronger than the locomotive, but the more the car will run faster. And the locomotive was strong and there were no more carriages to drag. The existing analysis algorithms, often for the "locomotive" type, many times there is no way to split into a lot of small operations distributed to each node. So, if you follow the previous algorithm, then you must add additional software solutions to the already distributed data again "concentrated", the additional link, certainly time-consuming and laborious, the effect is not likely to be good.

In my opinion, the traditional vendors ' solution to the puzzle of large data application is not the best solution. What is the best solution? In fact, it is very simple to develop new application analysis scenarios for new datasets and database structure features, and to run these analysis applications directly to large data architectures. Instead of going to fit, take the new newsql, NoSQL grafting traditional program.

The advantage of doing so is self-evident, the key is how to achieve? These things can't be told to the business by the people who do it. Large data applications to really take root in the enterprise flowering, really need some data scientists do demand generation (Demand generation) work. We're going to flip through their help to get the big data paths in this picture, like the traditional data processing, by business people telling us what they want to do!

I've been with a lot of customers, and all I've got is the need to know about Hadoop or the memory database. But then they found out that they didn't really know what Hadoop or the memory database could do for them, and hopefully we can tell them. But frankly, this is not what we do for the IT infrastructure. We have been "ahead" of the reserves of such technical means, how to use this kind of technology is really should understand the business people to think, not us.

So, here I want to appeal to the IT industry, the top of the pyramid of professional consultants, data analysts, data scientists, now is the time to step out of the original framework to see new technologies under the new architecture of some opportunities. Do not always shackle the traditional ideas and methods, so that new large data ideas to do "fit" things. Sincerely hope that you can use professional knowledge and industry experience, to help those "big data if thirsty" industry users to properly locate the real value of their new applications, design more meaningful distributed algorithms and machine learning model, really help them solve the big data application of the confusion.

(Responsible editor: Fumingli)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.