Big data itself is a very broad concept, and the Hadoop ecosystem (or pan-biosphere) is basically designed to handle data processing over single-machine scale. You can compare it to a kitchen so you need a variety of tools. Pots and pans, each have their own use, and overlap with each other. You can use a soup pot directly as a bowl for dinner, and you can peel it with a knife or plane. But each tool has its own characteristics, although strange combinations can work, but not necessarily the best choice.
Big data, first you have to be able to save the big data.
Traditional file systems are stand-alone and cannot span different machines. The design of HDFS (Hadoop distributed FileSystem) is essentially for a large amount of data to span hundreds of machines, but you see a file system rather than many file systems. For example, you say I want to get/hdfs/tmp/file1 data, you refer to a file path, but the actual data is stored on many different machines. You as a user, do not need to know these, like on a stand-alone you do not care about the files scattered on what track what sector. HDFs manages this data for you.
After you have saved the data, you start to think about how to process the data. Although HDFs can manage the data on different machines for you as a whole, the data is too large. A machine reads the data from P on t (very big data, such as the whole of Tokyo hot all-time high-definition movie size even bigger), a machine running slowly may take several days or even weeks. For many companies, stand-alone processing is intolerable, such as Weibo to update the 24-hour hot bo, it must be within 24 hours to run through these processing. So if I have a lot of machines to deal with, I'm going to have to work on how to do it, if a machine hangs up on how to restart the task, how the machines communicate with each other to exchange data for complex computations and so on. This is the function of Mapreduce/tez/spark. MapReduce is the first generation of computing engines, and Tez and Spark are the second generation. The MapReduce design, which uses a very simplified computational model, only maps and reduce two computational processes (in the middle with shuffle concatenation), with this model, can already handle a large part of the big data domain problem.
So what is a map what is reduce?
Consider if you want to count a huge text file stored in a similar hdfs, you want to know how often each word appears in this text. You have launched a MapReduce program. Map stage, hundreds of machines simultaneously read the various parts of the file, respectively, read the respective parts of the frequency of the statistics, produce similar (hello, 12,100), (World, 15,214 times) Wait for such a pair (I'll put the map and combine together to simplify it); Each of the hundreds of machines produced the same set, and then hundreds of machines started reduce processing. Reducer Machine A will receive all the statistical results starting with a from the Mapper machine, and Machine B will receive a lexical statistic from the beginning of B (of course not actually starting with a letter, but rather using a function to generate a hash value to avoid data serialization.) Because words like x must be much less than others, and you do not want to process the data for each machine with a disproportionate amount of work. These reducer will then be aggregated again, (hello,12100) + (hello,12311) + (hello,345881) = (hello,370292). Each reducer is treated as above, and you get the word frequency result of the entire file.
This may seem like a very simple model, but many algorithms can be described using this model. Map+reduce's simple model is very yellow and violent, though it works, but it's cumbersome. The second generation of Tez and spark, in addition to new feature such as memory caches, is essentially making the map/reduce model more generic, blurring the boundaries between map and reduce, making data exchange more flexible, and having fewer disk reads and writes to make it easier to describe complex algorithms, To achieve higher throughput.
With Mapreduce,tez and Spark, the programmer found that the MapReduce program was cumbersome to write. They want to simplify the process. This is like you have assembly language, although you can almost anything, but you still feel cumbersome. You want a higher-level, more abstract language layer to describe the algorithm and the data processing flow. So there was pig and hive. The pig is near the scripting way to describe the mapreduce,hive using SQL. They translate scripts and SQL languages into MapReduce programs, throw them to compute engines, and you're freed from tedious mapreduce programs to write programs in simpler, more intuitive languages.
With Hive, there is a huge advantage in SQL versus Java. One is that it is too easy to write. The word frequency of the thing, with SQL description is only one or two lines, MapReduce wrote about dozens of hundred lines. And more importantly, the non-computer background of the user finally felt the love: I will write sql! The data analysts finally freed themselves from the dilemma of begging engineers to help, and engineers were freed from writing strange, one-off handlers. Everyone was happy. Hive grew into the core component of the Big Data Warehouse. Even many of the company's assembly line operation is fully described in SQL, because easy to write easy to change, one can understand, easy to maintain.
Since data analysts began using hive to analyze data, they found that hive was running on MapReduce, too slow! Assembly line Job set may not have anything to do with, such as 24 hours of updated recommendations, anyway, 24 hours after the run is over. But data analysis, people always want to run faster. For example, I would like to see how many people have stayed on the wearable bracelet page for the last one hours, and how long they have been there, and it may take a few 10 minutes or even many hours for a giant web site to have huge amounts of data. And this analysis may just be your Long March first step, you also want to see how many people browse the electronic products how many people see the Rachmaninoff CD, in order to report with the boss, our users are the cock silk male stuffy female more or more literary youth/young girls. You can't stand the torment of waiting, just tell the engineer, come on, come on, hurry up!
So Impala,presto,drill was born (and of course countless non-famous interactive SQL engines, not listed). The core idea of the three systems is that the MapReduce engine is too slow because it is too generic, too strong, too conservative, that we need more lightweight, more aggressive access to resources, more dedicated to SQL optimization, and no need for so many fault tolerance guarantees (because the system has failed to restart the task, If the entire processing time is shorter, such as within a few minutes). These systems allow users to handle SQL tasks more quickly, sacrificing versatility and stability. If the MapReduce is a machete, cut anything is not afraid, the top three is a bone cutter, smart sharp, but not too big too hard things.
These systems, to tell the truth, have not been as popular as people expect. Because at this time, two other aliens were created. They are hive on Tez/spark and Sparksql. The idea is that MapReduce is slow, but if I run SQL with a new generation of universal compute engine tez or spark, I can run faster. And the user does not need to maintain two sets of systems. This is like if your kitchen is small, people are lazy, to eat the fine degree of limited requirements, then you can buy a rice cooker, can steam can burn, save a lot of kitchen utensils.
The above introduction, basically is a data warehouse framework. Bottom HDFs, run Mapreduce/tez/spark above, run Hive,pig on top. or run Impala,drill,presto directly on HDFs. This solves the requirement of medium and low speed data processing.
Lao Li shares big data biosphere 1