A technology ecosystem that understands big data

Last Update:2015-03-13 Source: Internet

Author: User

Tags abstract language hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Big data itself is a very broad concept, and the Hadoop ecosystem (or pan-biosphere) is basically designed to handle data processing over single-machine scale. You can compare it to a kitchen so you need a variety of tools. Pots and pans, each have their own use, and overlap with each other. You can use a soup pot directly as a bowl for dinner, and you can peel it with a knife or plane. But each tool has its own characteristics, although strange combinations can work, but not necessarily the best choice.

Big data, first you have to be able to save the big data. Traditional file systems are stand-alone and cannot span different machines. The design of HDFS (Hadoop distributed FileSystem) is essentially for a large amount of data to span hundreds of machines, but you see a file system rather than many file systems. For example, you say I want to get/hdfs/tmp/file1 data, you refer to a file path, but the actual data is stored on many different machines. You as a user, do not need to know these, like on a stand-alone you do not care about the files scattered on what track what sector. HDFs manages this data for you. Save the nextBig Data analyticsAfter that, you start thinking about what to do with the data. Although HDFs can manage the data on different machines for you as a whole, the data is too large. A machine reads the data from P on t (very big data, such as the whole of Tokyo hot all-time high-definition movie size even bigger), a machine running slowly may take several days or even weeks. For many companies, stand-alone processing is intolerable, such as Weibo to update the 24-hour hot bo, it must be within 24 hours to run through these processing. So if I have a lot of machines to deal with, I'm going to have to work on how to do it, if a machine hangs up on how to restart the task, how the machines communicate with each other to exchange data for complex computations and so on. This is the function of Mapreduce/tez/spark. MapReduce is the first generation of computing engines, and Tez and Spark are the second generation. The MapReduce design, which uses a very simplified computational model, only maps and reduce two computational processes (in the middle with shuffle concatenation), with this model, can already handle a large part of the big data domain problem. What is a map what is reduce? Consider if you want to count a huge text file stored in a similar hdfs, you want to know how often each word appears in this text. You have launched a MapReduce program. Map stage, hundreds of machines simultaneously read the various parts of the file, respectively, read the respective parts of the frequency of the statistics, produce similar (hello, 12,100), (World, 15,214 times) Wait for such a pair (I'll put the map and combine together to simplify it); Each of the hundreds of machines produced the same set, and then hundreds of machines started reduce processing. Reducer Machine A will receive all the statistical results starting with a from the Mapper machine, and Machine B will receive a lexical statistic from the beginning of B (of course not actually starting with a letter, but rather using a function to generate a hash value to avoid data serialization.) Because words like x must be much less than others, and you do not want to process the data for each machine with a disproportionate amount of work. These reducer will then be aggregated again, (hello,12100) + (hello,12311) + (hello,345881) = (hello,370292). Each reducer is treated as above, and you get the word frequency result of the entire file. This seems like a very simple model, but many algorithms can be described with this model. map+reduce's simple model is very yellow and violent, though it works, but it's cumbersome. The second generation of Tez and spark, in addition to new feature such as memory caches, essentially makes the map/reduce model more generic, allowing maps and redThe boundaries between UCE are more ambiguous, data exchange is more flexible, and less disk reads and writes to make it easier to describe complex algorithms and achieve higher throughput. with Mapreduce,tez and spark, programmers find that the MapReduce program is really cumbersome to write. They want to simplify the process. This is like you have assembly language, although you can almost anything, but you still feel cumbersome. You want a higher-level, more abstract language layer to describe the algorithm and the data processing flow. So there was pig and hive. The pig is near the scripting way to describe the mapreduce,hive using SQL. They translate scripts and SQL languages into MapReduce programs, throw them to compute engines, and you're freed from tedious mapreduce programs to write programs in simpler, more intuitive languages. with Hive, it is found that SQL has a huge advantage over Java. One is that it is too easy to write. The word frequency of the thing, with SQL description is only one or two lines, MapReduce wrote about dozens of hundred lines. And more importantly, the non-computer background of the user finally felt the love: I will write sql! The data analysts finally freed themselves from the dilemma of begging engineers to help, and engineers were freed from writing strange, one-off handlers. Everyone was happy. Hive grew into the core component of the Big Data Warehouse. Even many of the company's assembly line operation is fully described in SQL, because easy to write easy to change, one can understand, easy to maintain. Since data analysts began using hive to analyze data, they found that hive was running on MapReduce, and that the real penis was slow! Assembly line Job set may not have anything to do with, such as 24 hours of updated recommendations, anyway, 24 hours after the run is over. But data analysis, people always want to run faster. For example, I would like to see how many people have stopped on the Inflatable doll page for the last one hours, how long they have stayed, and for a huge web site, this process may take a few 10 minutes or even many hours. And this analysis may just be your Long March first step, you also want to see how many people browse the jumping egg How many people watched the Rachmaninoff CD, in order to report with the boss, our users are wretched male stuffy female more or more literary youth/young girls. You can't stand the torment of waiting, only with the handsome engineer Grasshopper said, fast, fast, a little faster! So Impala,presto,drill was born (and of course there are countless non-famous interactive SQL engines, not listed here). The core idea of the three systems is that the MapReduce engine is too slow because it is too generic, too strong, too conservative, that we need more lightweight, more aggressive access to resources, more dedicated to SQL optimization, and no need for so many fault tolerance guarantees (because the system has a big problem restarting the task, if the wholeShorter time, such as within a few minutes). These systems allow users to handle SQL tasks more quickly, sacrificing versatility and stability. If the MapReduce is a machete, cut anything is not afraid, the top three is a bone cutter, smart sharp, but not too big too hard things. These systems, to tell the truth, have not reached the popular expectations. Because at this time, two other aliens were created. They are hive on Tez/spark and Sparksql. The idea is that MapReduce is slow, but if I run SQL with a new generation of universal compute engine tez or spark, I can run faster. And the user does not need to maintain two sets of systems. This is like if your kitchen is small, people are lazy, to eat the fine degree of limited requirements, then you can buy a rice cooker, can steam can burn, save a lot of kitchen utensils. The above introduction, basically is a Big Data Structure of the. Bottom HDFs, run Mapreduce/tez/spark above, run Hive,pig on top. or run Impala,drill,presto directly on HDFs. This solves the requirement of medium and low speed data processing. What if I have to deal with it at a higher speed? If I am a microblogging-like company, I would like to show not 24 hours of hot Bo, I want to see a constantly changing hit list, the update delay in a minute, the above means will not be competent. Then another computational model was developed, which is the streaming (stream) calculation. Storm is the most popular streaming computing platform. The idea of stream computing is that if you want to achieve a more real-time update, why don't I deal with it when the data flow comes in? For example, the word frequency statistics, my data flow is one of the words, I let them flow through me on the side began to count. Flow calculation is very good, basically no delay, but its shortcomings are, not flexible, you want to count things must know beforehand, after all, the data flow is gone, you do not count things can not be mended. So it's a good thing, but it can't replace the Data warehouse and batch system. There is also a separate module that is KV Store, such as Cassandra,hbase,mongodb and many many, many others (too much to imagine). So the KV store means that I have a bunch of key values that I can quickly get to the data bound to this key. For example, I use a social security number, can take your identity data. This action can be done with mapreduce, but it is possible to scan the entire data set. The KV store is dedicated to this operation, and all the save and fetch are optimized for this purpose. Find a social Security number from several P's data, perhaps as long as fraction seconds. This has made some of the specialized operations of big data companies vastly optimized. For example, I have a page on the page based on the order number to find the order content, and the entire site order number can not be stored on a single database, I will consider the KV store to save. KV Store's philosophy is that the basic inability to deal with complex calculations, mostly can not join, perhaps not aggregation, no strong consistency guarantee (different data distributed on different machines, you may read the different results each time, you can not handle similar to bank transfer as strong consistency requirements of the operation). But ya is fast. Extremely fast. Each of the different KV store designs has a different trade-offs, some faster, some higher capacity, and some that support more complex operations. There must be one for you. In addition, there are some more specialized systems/components, such as Mahout is a distributed machine learning Library, PROTOBUF is the data interchange encoding and library, zookeeper is a high-consistency distributed access coordination system, and so on. With so many messy tools, all running on the same cluster, we need to respect the orderly work of each other. So another important component is the dispatch system. Yarn is the most popular now. You can think of him as a central management, like your mother in the kitchen supervisor, hey, your sister cut the vegetables are finished, you can take the knife to kill the chicken. As long as everyone obeys your mother's assignment, everyone can have a pleasant cooking. You can think of the big data biosphere as a kitchen tool ecosystem. In order to do different dishes, Chinese cuisine, Japanese cuisine, French cuisine, you need a variety of different tools. And the needs of the guests are complicating, and your kitchen utensils are constantly being invented, and no one can handle all the situations, so it becomes more and more complex. "Learn more about the business intelligence industry, business intelligence solutions and business intelligence software downloads at Finebi Business Intelligence official website www.finebi.com"

A technology ecosystem that understands big data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More