First, the original link: the original link
The big data itself is a very broad concept, and the Hadoop
biosphere (or the pan-biosphere) is basically designed to handle more data processing than the single-machine scale. You can compare it to a kitchen so you need a variety of tools.
Pots and pans, each have their own use. Overlap with each other again. You can use a soup pot directly as a bowl to eat and drink soups, you can use a knife or plane peel.
But every job has its own characteristics, although strange combinations can work, but not necessarily the best choice.
Big data, first you have to be able to save the big data.
Traditional file systems are stand-alone and cannot span different machines.
HDFS
( Hadoop Distributed File System
) is designed essentially for large amounts of data to span hundreds of machines, but you see a filesystem rather than a very many file system. For example, you say the data that I want to get /hdfs/tmp/file1
, you refer to a file path, but the actual data is stored on very many different machines. You as a user. You do not need to know this, just like on a single machine you do not care about the files scattered on what track what sector.
HDFS
Manage this data for you.
After you have saved the data, you start to think about what to do with the data. Although HDFS
it is possible to manage data on different machines for you in general, the data is too large.
A machine reads the T
data on it P
(very big Data oh. For example, the entire Tokyo heat history of all the high-definition movie size even larger). It may take days or even weeks for a machine to run slowly. For very many companies, stand-alone processing is intolerable, such as Weibo to update the 24-hour hot bo, it must be within 24 hours to run through these processing.
So I'm assuming that with a lot of machines, I'm going to have to work on the assignment. Suppose a machine hangs up. How to start the corresponding task again. How the machines communicate with each other to exchange data to complete complex computations and so on. This is MapReduce/ Tez/Spark
the function.
MapReduce
Is the first generation computing engine, Tez
and Spark
is the second generation. MapReduce
design, using a very simplified computational model. There are Map
only Reduce
two computational processes (intermediate in Shuffle
series), using this model. has been able to handle a very large part of the big Data world.
So what is a map what is reduce?
Consider the assumption that you want to count a huge text file stored in a similar HDFS
order, and you want to know how often each word appears in this text. You have started a MapReduce
program. Map
stage, hundreds of machines read each part of the file at the same time. respectively, the reading of the respective readings of the frequency of the word, to produce similar ( hello, 12100次
), ( world。15214次
) and so on ( Pair
I put Map
and Combine
put together in order to simplify). Each of the hundreds of machines produced a collection of the above, and then hundreds of machines started Reduce
processing. The machine will Reducer
A
Mapper
receive all the statistical results from the machine at the beginning A
, and the machine B
will receive B
the lexical statistics at the beginning (of course, it will not actually start with a letter, but instead use the function to generate Hash
values to avoid data serialization.) Since X
the similarity starts with a lot less words than others, you don't want to have a huge difference in the amount of data processed by each machine. These Reducer
will then be aggregated again, ( hello,12100
) + () hello,12311
+ () hello,345881
= ( hello。370292
). Each one Reducer
is treated as above, and you get the word frequency result of the entire file.
This may seem like a very easy model, but very many algorithms can be used to describe the model.
Map+Reduce
The simple model is very yellow and very violent, although easy to use, but very cumbersome.
The second generation Tez
and Spark
new, in addition to memory Cache
feature
, are essentially making the Map/Reduce
model more generic and Map
Reduce
blurring the boundaries between them. Data exchange is more flexible. Less disk reads and writes to make it easier to describe complex algorithms and achieve higher throughput.
There was MapReduce
, Tez
and Spark
after that, the program ape found that MapReduce
the program was really troublesome to write. They want to simplify the process.
It's like you have assembly language. Although you almost can do anything, but you still feel the tedious. You want a higher-level, more abstract language layer to describe the narrative algorithm and the data processing flow. So there's a Pig
and Hive
. Pig
is close to the scripting way to describe the narrative MapReduce
, Hive
it is used SQL
. They translate scripts and SQL
languages into MapReduce
programs, throw them to compute engines, and you get freed from tedious MapReduce
programs. Tap the code in a simpler, more intuitive language.
Hive
after that, people found that SQL
Java
the control had a huge advantage. One is it too easy to write.
Just the word frequency thing. SQL
there are only one or two lines written in descriptive narratives, MapReduce
about dozens of hundred lines. And more importantly, the non-computer background of the user finally felt the love: I will write SQL
. The data analyst eventually freed from the dilemma of begging the project manager to help, and the project teacher freed himself from writing strange, one-off handlers.
Everyone was happy. Hive
gradually became the core component of the Big Data Warehouse . Even very many of the company's assembly line operations are fully SQL
descriptive. Because easy to write easy to change, a look on the understanding, ease maintenance.
Since data analysts began to Hive
analyze data, they found that they Hive
were MapReduce
running. It's too slow! Assembly line job set may not matter, for example, 24 hours of updated recommendations, anyway, 24 hours after the run is over. But data analysis, people always want to run faster. For example, I want to see how many people have stopped at the wearable bracelet page in the last one hours. How long were they staying? For a giant site with massive data, this process may take a few 10 minutes or even very many hours. And this analysis may just be your Long March first step, you also want to see how many people browse the electronic products how many people see Rachmaninoff CD
, in order to report with the boss. Our users are cock silk man stuffy girl many other or literary youth/girls many others. You can't stand the torment of waiting, just talk to the project master. Fast. Come on, let's go a little faster.
So Impala
. Presto
. Was Drill
born (and of course countless non-famous interactive SQL
engines, not listed here).
The core idea of the three systems is that the MapReduce
engine is too slow, because it is too versatile and too strong. Too conservative, we SQL
need to get resources more lightly and more aggressively. More specifically to SQL
do optimization. And it doesn't require that much fault-tolerant assurance (because of a system error, it's a big deal to start a task again.) Suppose the entire processing time is shorter, for example, within a few minutes). These systems allow users to process tasks more quickly SQL
. Sacrificing the versatility and stability of the features. Suppose that MapReduce
is a machete, cut anything is not afraid, the top three is a bone knife. Clever Ruili, but not too big too hard things.
These systems, to tell the truth. Has not met the popularity of people's expectations.
Because of this time, two other aliens were created.
They are Hive on Tez / Spark
and SparkSQL
. Their design philosophy is. MapReduce
slow, but assuming I run with a new generation of universal Computing engines Tez
Spark
SQL
, I can run faster. And the user does not need to maintain two sets of systems.
It's like assuming you have a small kitchen and a lazy person. There is a limit to the fine degree of eating. Then you can buy a rice cooker. Can steam boil can burn, save a lot of kitchenware.
The above introduction, basically is a data warehouse framework.
The bottom HDFS
. Run above MapReduce/Tez/Spark
, run on top Hive,Pig
. Or HDFS
run straight on Impala。Drill。Presto
. This overcomes the requirement of medium and low speed data processing.
What if I have to deal with it more quickly?
Suppose I am a similar microblogging company, I want to show not 24 hours of hot Bo, I want to see a constantly changing hit list, update delay in a minute, the above means will not be competent. Then another computational model was developed. This is the Streaming
(stream) calculation.
Storm
Is the most popular streaming computing platform . The idea of flow calculation is. Suppose you want to achieve a more real-time update, why don't I deal with it when the data flow comes in? For example, it is a sample of Word frequency statistics. My data flow is one word, I let them flow through me on the side to start counting.
Flow calculation is very good, basically no delay, but its shortcomings are. Not flexible. The things you want to count must be known in advance, after all, the data flow is gone. You can't make up for what you don't count. So it's a very good thing. However, the above data warehouse and batch processing system cannot be replaced.
Another independent module is KV Store
, for example Cassandra
. HBase
. And very many very much very many very many MongoDB
others (more than unimaginable). So KV Store
that is to say, I have a bunch of key values, I can get very fast drops Key
of data with this binding. For example, I use a social security number to get your identity data. This action MapReduce
can also be completed. However, it is very possible to scan the entire data set. and KV Store
dedicated to handle this operation, all the deposit and fetch are specifically optimized for this purpose. P
looking for a social security number from several data, maybe just fraction seconds. This has made some of the specialized operations of big data companies vastly optimized. For example, I have a page on the page to find the order content based on the order number, and the entire site order number can not be stored in a single database, I will consider KV Store
to save. KV Store
The idea is that a complex calculation cannot be handled in a basic way. Most of them can't JOIN
. may not be able to converge. There is no strong consistency guarantee (different data is distributed on different machines.) You may read different results each time you read, and you will not be able to handle the same strong conformance requirements as bank transfers.
But ya is fast. Extremely fast.
Each different KV Store
design has a different trade-offs, some faster, some more capacity, and some that support more complex operations. There must be one for you.
Besides. Other more specialized systems/components, such Mahout
as distributed machine learning libraries, are the Protobuf
encoding and library of data interchange, ZooKeeper
are highly consistent distributed access cooperative systems, and so on.
With so many messy tools. are running on the same cluster, we need to work with each other in a respectful and orderly manner.
So another important component is the dispatch system. Now, the most popular is Yarn
.
You can think of him as a central management, like your mother in the kitchen supervisor, hey, your sister cut the vegetables are finished, you can take the knife to kill the chicken.
Just want everyone to obey your mother assigned, that everyone can be happy to cook vegetables.
You can feel that the big data biosphere is a kitchen tool ecosystem. To make a different dish. Chinese food. Japanese cuisine, French cuisine, you need a variety of different tools.
And the needs of the guests are complicating. Your kitchen utensils are constantly being invented, and no one can handle the whole situation, so it becomes more and more complex.
Copyright notice: This article blog original article. Blogs, without consent, may not be reproduced.
Learn about the technology ecosystem of Big Data Hadoop,hive,spark (reprint)