Learn about the technology ecosystem of Big Data Hadoop,hive,spark (reprint)

Last Update:2015-08-09 Source: Internet

Author: User

Tags abstract language

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the original link: the original link

The big data itself is a very broad concept, and the Hadoop biosphere (or the pan-biosphere) is basically designed to handle more data processing than the single-machine scale. You can compare it to a kitchen so you need a variety of tools.

Pots and pans, each have their own use. Overlap with each other again. You can use a soup pot directly as a bowl to eat and drink soups, you can use a knife or plane peel.

But every job has its own characteristics, although strange combinations can work, but not necessarily the best choice.

Big data, first you have to be able to save the big data.

Traditional file systems are stand-alone and cannot span different machines.

HDFS( Hadoop Distributed File System ) is designed essentially for large amounts of data to span hundreds of machines, but you see a filesystem rather than a very many file system. For example, you say the data that I want to get /hdfs/tmp/file1 , you refer to a file path, but the actual data is stored on very many different machines. You as a user. You do not need to know this, just like on a single machine you do not care about the files scattered on what track what sector.

HDFSManage this data for you.

After you have saved the data, you start to think about what to do with the data. Although HDFS it is possible to manage data on different machines for you in general, the data is too large.

A machine reads the T data on it P (very big Data oh. For example, the entire Tokyo heat history of all the high-definition movie size even larger). It may take days or even weeks for a machine to run slowly. For very many companies, stand-alone processing is intolerable, such as Weibo to update the 24-hour hot bo, it must be within 24 hours to run through these processing.

So I'm assuming that with a lot of machines, I'm going to have to work on the assignment. Suppose a machine hangs up. How to start the corresponding task again. How the machines communicate with each other to exchange data to complete complex computations and so on. This is MapReduce/ Tez/Spark the function.

MapReduceIs the first generation computing engine, Tez and Spark is the second generation. MapReducedesign, using a very simplified computational model. There are Map only Reduce two computational processes (intermediate in Shuffle series), using this model. has been able to handle a very large part of the big Data world.

So what is a map what is reduce?

Consider the assumption that you want to count a huge text file stored in a similar HDFS order, and you want to know how often each word appears in this text. You have started a MapReduce program. Mapstage, hundreds of machines read each part of the file at the same time. respectively, the reading of the respective readings of the frequency of the word, to produce similar ( hello, 12100次 ), ( world。15214次 ) and so on ( Pair I put Map and Combine put together in order to simplify). Each of the hundreds of machines produced a collection of the above, and then hundreds of machines started Reduce processing. The machine will Reducer A Mapper receive all the statistical results from the machine at the beginning A , and the machine B will receive B the lexical statistics at the beginning (of course, it will not actually start with a letter, but instead use the function to generate Hash values to avoid data serialization.) Since X the similarity starts with a lot less words than others, you don't want to have a huge difference in the amount of data processed by each machine. These Reducer will then be aggregated again, ( hello，12100 ) + () hello，12311 + () hello，345881 = ( hello。370292 ). Each one Reducer is treated as above, and you get the word frequency result of the entire file.

This may seem like a very easy model, but very many algorithms can be used to describe the model.

Map＋ReduceThe simple model is very yellow and very violent, although easy to use, but very cumbersome.

The second generation Tez and Spark new, in addition to memory Cache feature , are essentially making the Map/Reduce model more generic and Map Reduce blurring the boundaries between them. Data exchange is more flexible. Less disk reads and writes to make it easier to describe complex algorithms and achieve higher throughput.

There was MapReduce , Tez and Spark after that, the program ape found that MapReduce the program was really troublesome to write. They want to simplify the process.

It's like you have assembly language. Although you almost can do anything, but you still feel the tedious. You want a higher-level, more abstract language layer to describe the narrative algorithm and the data processing flow. So there's a Pig and Hive . Pigis close to the scripting way to describe the narrative MapReduce , Hive it is used SQL . They translate scripts and SQL languages into MapReduce programs, throw them to compute engines, and you get freed from tedious MapReduce programs. Tap the code in a simpler, more intuitive language.

Hiveafter that, people found that SQL Java the control had a huge advantage. One is it too easy to write.

Just the word frequency thing. SQLthere are only one or two lines written in descriptive narratives, MapReduce about dozens of hundred lines. And more importantly, the non-computer background of the user finally felt the love: I will write SQL . The data analyst eventually freed from the dilemma of begging the project manager to help, and the project teacher freed himself from writing strange, one-off handlers.

Everyone was happy. Hivegradually became the core component of the Big Data Warehouse . Even very many of the company's assembly line operations are fully SQL descriptive. Because easy to write easy to change, a look on the understanding, ease maintenance.

Since data analysts began to Hive analyze data, they found that they Hive were MapReduce running. It's too slow! Assembly line job set may not matter, for example, 24 hours of updated recommendations, anyway, 24 hours after the run is over. But data analysis, people always want to run faster. For example, I want to see how many people have stopped at the wearable bracelet page in the last one hours. How long were they staying? For a giant site with massive data, this process may take a few 10 minutes or even very many hours. And this analysis may just be your Long March first step, you also want to see how many people browse the electronic products how many people see Rachmaninoff CD , in order to report with the boss. Our users are cock silk man stuffy girl many other or literary youth/girls many others. You can't stand the torment of waiting, just talk to the project master. Fast. Come on, let's go a little faster.

So Impala . Presto. Was Drill born (and of course countless non-famous interactive SQL engines, not listed here).

The core idea of the three systems is that the MapReduce engine is too slow, because it is too versatile and too strong. Too conservative, we SQL need to get resources more lightly and more aggressively. More specifically to SQL do optimization. And it doesn't require that much fault-tolerant assurance (because of a system error, it's a big deal to start a task again.) Suppose the entire processing time is shorter, for example, within a few minutes). These systems allow users to process tasks more quickly SQL . Sacrificing the versatility and stability of the features. Suppose that MapReduce is a machete, cut anything is not afraid, the top three is a bone knife. Clever Ruili, but not too big too hard things.

These systems, to tell the truth. Has not met the popularity of people's expectations.

Because of this time, two other aliens were created.

They are Hive on Tez / Spark and SparkSQL . Their design philosophy is. MapReduceslow, but assuming I run with a new generation of universal Computing engines Tez Spark SQL , I can run faster. And the user does not need to maintain two sets of systems.

It's like assuming you have a small kitchen and a lazy person. There is a limit to the fine degree of eating. Then you can buy a rice cooker. Can steam boil can burn, save a lot of kitchenware.

The above introduction, basically is a data warehouse framework.

The bottom HDFS . Run above MapReduce／Tez／Spark , run on top Hive，Pig . Or HDFS run straight on Impala。Drill。Presto . This overcomes the requirement of medium and low speed data processing.

What if I have to deal with it more quickly?

Suppose I am a similar microblogging company, I want to show not 24 hours of hot Bo, I want to see a constantly changing hit list, update delay in a minute, the above means will not be competent. Then another computational model was developed. This is the Streaming (stream) calculation.

StormIs the most popular streaming computing platform . The idea of flow calculation is. Suppose you want to achieve a more real-time update, why don't I deal with it when the data flow comes in? For example, it is a sample of Word frequency statistics. My data flow is one word, I let them flow through me on the side to start counting.

Flow calculation is very good, basically no delay, but its shortcomings are. Not flexible. The things you want to count must be known in advance, after all, the data flow is gone. You can't make up for what you don't count. So it's a very good thing. However, the above data warehouse and batch processing system cannot be replaced.

Another independent module is KV Store , for example Cassandra . HBase. And very many very much very many very many MongoDB others (more than unimaginable). So KV Store that is to say, I have a bunch of key values, I can get very fast drops Key of data with this binding. For example, I use a social security number to get your identity data. This action MapReduce can also be completed. However, it is very possible to scan the entire data set. and KV Store dedicated to handle this operation, all the deposit and fetch are specifically optimized for this purpose. Plooking for a social security number from several data, maybe just fraction seconds. This has made some of the specialized operations of big data companies vastly optimized. For example, I have a page on the page to find the order content based on the order number, and the entire site order number can not be stored in a single database, I will consider KV Store to save. KV StoreThe idea is that a complex calculation cannot be handled in a basic way. Most of them can't JOIN . may not be able to converge. There is no strong consistency guarantee (different data is distributed on different machines.) You may read different results each time you read, and you will not be able to handle the same strong conformance requirements as bank transfers.

But ya is fast. Extremely fast.

Each different KV Store design has a different trade-offs, some faster, some more capacity, and some that support more complex operations. There must be one for you.
Besides. Other more specialized systems/components, such Mahout as distributed machine learning libraries, are the Protobuf encoding and library of data interchange, ZooKeeper are highly consistent distributed access cooperative systems, and so on.

With so many messy tools. are running on the same cluster, we need to work with each other in a respectful and orderly manner.

So another important component is the dispatch system. Now, the most popular is Yarn .

You can think of him as a central management, like your mother in the kitchen supervisor, hey, your sister cut the vegetables are finished, you can take the knife to kill the chicken.

Just want everyone to obey your mother assigned, that everyone can be happy to cook vegetables.

You can feel that the big data biosphere is a kitchen tool ecosystem. To make a different dish. Chinese food. Japanese cuisine, French cuisine, you need a variety of different tools.

And the needs of the guests are complicating. Your kitchen utensils are constantly being invented, and no one can handle the whole situation, so it becomes more and more complex.

Learn about the technology ecosystem of Big Data Hadoop,hive,spark (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More