Ecosystem diagram of Big Data engineering

Source: Internet
Author: User
Tags abstract language

Ecosystem diagram of Big Data

Thinking in Bigdata (eight) Big Data Hadoop core architecture hdfs+mapreduce+hbase+hive internal mechanism

A brief talk on the 6 luminous dots of Apache Spark

Big data, first you have to be able to save the big data. Traditional file systems are stand-alone and cannot span different machines. the design of HDFS (Hadoop distributed FileSystem) is essentially for a large amount of data to span hundreds of machines, but you see a file system rather than many file systems. For example, you say I want to get/hdfs/tmp/file1 data, you refer to a file path, but the actual data is stored on many different machines. you as a user, do not need to know these, like on a stand-alone you do not care about the files scattered on what track what sector. HDFs manages this data for you.

After you have saved the data, you start to think about how to process the data. Although HDFs can manage the data on different machines for you as a whole , the data is too large. a machine reads the data from P on T, and it may take days or even weeks for a machine to run slowly. for many companies, stand-alone processing is intolerable, such as Weibo to update the 24-hour hot bo, it must be within 24 hours to run through these processing. So if I have a lot of machines to deal with, I'm going to have to work on how to do it, if a machine hangs up on how to restart the task, how the machines communicate with each other to exchange data for complex computations and so on. This is the function of Mapreduce/tez/spark. MapReduce is the first generation of computing engines, and Tez and Spark are the second generation. The MapReduce design, which uses a very simplified computational model, only maps and reduce two computational processes (in the middle with shuffle concatenation), with this model, can already handle a large part of the big data domain problem.
So what is a map what is reduce?


     Consider if you want to count a huge text file stored in a similar hdfs, you want to know how often each word appears in this text. You have launched a MapReduce program. map stage, hundreds of machines simultaneously read the various parts of the file, respectively, read the respective parts of the reading frequency, respectively, produce a similar (hello, 12,100), (World, 15,214 times) Wait for such a pair (I'm going to put the map and combine together to simplify it); the hundreds of machines each produce a collection as above, and then hundreds of machines start reduce processing. Reducer Machine A will receive all the statistical results starting with a from the Mapper machine, and Machine B will receive a lexical statistic from the beginning of B (of course not actually starting with a letter, but rather using a function to generate hash values to avoid data serialization.) Because words like x must be much less than others, and you do not want to process the data for each machine with a disproportionate amount of work. then these reducer will be rolled up again, (hello,12100) + (hello,12311) + (hello,345881) = (hello,370292). Each reducer is treated as above, and you get the word frequency result of the entire file. This may seem like a very simple model, but many algorithms can be described using this model. The simple model of the
    map+reduce is very yellow and violent, though it works, but it's cumbersome. the second generation of Tez and spark, in addition to new feature such as memory caches, is essentially making the map/reduce model more generic, blurring the boundaries between map and reduce, making data exchange more flexible, and having fewer disk reads and writes. To make it easier to describe complex algorithms and achieve higher throughput. The biggest difference between

    spark and Hadoop is that it makes some architectural improvements on the basis of Hadoop. Hadoop uses a hard disk to store data, and Spark uses memory to store data , so spark can provide more than Hadoop100 times of computing speed. However, because memory loses data after a power outage, spark cannot be used to process data that requires long-term retention.

Storm is a distributed computing system driven by Twitter, developed by the Backtype team, and is the incubation project of the Apache Foundation. It provides real-time computing on the basis of Hadoop, and can handle big data streams in real time. Unlike Hadoop and Spark,storm, which does not collect and store data, it accepts data directly through the network and processes the data in real time, and then returns the results directly through the network.
Hadoop,spark and Storm are currently the three most important distributed computing systems, Hadoop is often used to offline complex large data processing, spark is often used for offline fast big data processing, and storm is often used for online real-time Big data processing.

With Mapreduce,tez and Spark, the programmer discovers thatthe MapReduce program is very cumbersome to write. They want to simplify the process. This is like you have assembly language, although you can almost anything, but you still feel cumbersome.you want a higher-level, more abstract language layer to describe the algorithm and the data processing flow. So there was pig and hive. The Pig is near the scripting way to describe the mapreduce,hive using SQL. They translate scripts and SQL languages into MapReduce programs, throw them to compute engines, and you're freed from tedious mapreduce programs to write programs in simpler, more intuitive languages. shark, like "SQL on Spark", is an implementation of a data warehouse on spark that can perform up to 100 times times more than hive in the case of hive compatibility.
With Hive, there is a huge advantage in SQL versus Java. One is that it is too easy to write. The word frequency of the thing, with SQL description is only one or two lines, MapReduce wrote about dozens of hundred lines. And more importantly, the non-computer background of the user finally felt the love: I will write sql! The data analysts finally freed themselves from the dilemma of begging engineers to help, and engineers were freed from writing strange, one-off handlers. Everyone was happy. Hive grew into the core component of the Big Data Warehouse. Even many of the company's assembly line operation is fully described in SQL, because easy to write easy to change, one can understand, easy to maintain.
Since data analysts began to analyze data using hive, they found thatHive runs on MapReduce, real Dick slow .! Assembly line Job set may not have anything to do with, such as 24 hours of updated recommendations, anyway, 24 hours after the run is over. But data analysis, people always want to run faster. For example, I would like to see how many people have stopped on the Inflatable doll page for the last one hours, how long they have stayed, and for a huge web site, this process may take a few 10 minutes or even many hours. And this analysis may just be your Long March first step, you also want to see how many people browse the jumping egg How many people watched the Rachmaninoff CD, in order to report with the boss, our users are wretched male stuffy female more or more literary youth/young girls. You can't stand the torment of waiting, only with the handsome engineer Grasshopper said, fast, fast, a little faster!


SoImpala,presto,drillwas born (and of course countless non-famousInteractive SQL engine, they are not listed.) The core idea of the three systems is that the MapReduce engine is too slow because it's too generic, too strong, too conservative, and we need SQLmore lightweight, more aggressive access to resources, more specialized in SQL optimization, and no need for so many fault tolerance guarantees(Because the system is in trouble, restart the task, if the entire processing time is shorter, such as within a few minutes). These systems allow users to handle SQL tasks more quickly, sacrificing versatility and stability. If the MapReduce is a machete, cut anything is not afraid, the top three is a bone cutter, smart sharp, but not too big too hard things.

These systems, to tell the truth, have not been as popular as people expect. Because at this time, two other aliens were created. They areHive on Tez/spark and Sparksql. Their design concept is that mapreduce is slow, but if Iusing a new generation of general purpose computing engine Tez or spark to run SQL, then I can run faster。 And the user does not need to maintain two sets of systems. This is like if your kitchen is small, people are lazy, to eat the fine degree of limited requirements, then you can buy a rice cooker, can steam can burn, save a lot of kitchen utensils.

The above introduction, basically is a data warehouse framework.Bottom HDFs, run Mapreduce/tez/spark above, run Hive,pig on top. or run Impala,drill,presto directly on HDFs. this solves the requirement of medium and low speed data processing.

What if I have to deal with it at a higher speed?
If I am a microblogging-like company, I would like to show not 24 hours of hot Bo, I want to see a constantly changing hit list, the update delay in a minute, the above means will not be competent. Then another computational model was developed, which isStreaming (stream) calculation. Storm is the most popular streaming computing platform. The idea of flow calculation is that if you want to achieve a more real-time update, why don't IIt's handled when the data flow comes in.? For example, the word frequency statistic example, my data flow is one of the words, I willlet them flow through me and start counting. Flow calculation is very good, basically no delay, but its shortcomings are, not flexible, you want to count things must know beforehand, after all, the data flow is gone, you do not count the things can not be mended。 So it's a good thing, but it can't replace the Data warehouse and batch system.

There's one otherseparate modules are KV Store, such as Cassandra,hbase,mongodbAnd a lot of many, many, many others (too much to imagine). SoKV Store that is, I have a bunch of key values, I can quickly get the data bound to this key。 For example, I use a social security number, can take your identity data.This action can be done with mapreduce, but it is possible to scan the entire data set. The KV store is dedicated to this operation, and all the save and fetch are optimized for this purpose. Find a social Security number from several P's data, perhaps as long as fraction seconds。 This has made some of the specialized operations of big data companies vastly optimized. For example, I have a page on the page based on the order number to find the order content, and the entire site order number can not be stored on a single database, I will consider the KV store to save.KV Store's philosophy is that the basic inability to deal with complex calculations, mostly can not join, perhaps not aggregation, no strong consistency guarantee (different data distributed on different machines, you may read the different results each time, you can not handle similar to bank transfer as strong consistency requirements of the operation). But ya is fast. Extremely fast. Each of the different KV store designs has a different trade-offs, some faster, some higher, and some to support more complex operations. There must be one for you.

In addition, there are some more specialized systems/components, such asMahout is a distributed machine learning library, Protobuf is the encoding and library of data interchange,Zookeeper is a highly consistent distributed access cooperative system.Wait a minute.

With so many messy tools running on the same cluster, everyone needs to work with each other in a respectful and orderly way. So another important component is thatscheduling System. Yarn is the most popular now. You can think of him as central management ., like your mother in the kitchen supervisor, hey, your sister cut the vegetables, you can take the knife to kill the chicken. As long as everyone obeys your mother's assignment, everyone can have a pleasant cooking.

You can think of the big data biosphere as a kitchen tool ecosystem. In order to do different dishes, Chinese cuisine, Japanese cuisine, French cuisine, you need a variety of different tools. And the needs of the guests are complicating, and your kitchen utensils are constantly being invented, and no one can handle all the situations, so it becomes more and more complex.

HBase can be thought of as Mysql,hdfs is a hard disk. HBase is just a onsql database that has data on HDFs. One is the Linux file system, and one is based on the MySQL that it installs.

    • Apache Hadoop system is really like an operating system. Mainly includes HDFs-equivalent to Linux under the EXT3,EXT4, and yarn-equivalent to Linux under the process scheduling and memory allocation module.

The main features of big data are large data volumes (Volume), Complex data categories (Variety), fast data processing (velocity) and high accuracy (Veracity), Together, it is called 4V.

Ecosystem diagram of Big Data engineering

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.