MapReduce and Batch processing------"designing data-intensive applications" reading notes 14

Last Update:2018-02-23 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

<blockquote> <blockquote> A lot of the previous articles were about distributed storage, and the next chapters went into the field of distributed Computing. Frankly speaking, the personal focus of the previous professional focus on storage, the content of many calculations may not be understood and accurate, if the article understanding is inappropriate, willing to Enlighten. This article will talk to you about a subset of distributed computing: batch processing . </blockquote> </blockquote>batch systems, often called offline systems , require a large amount of input data, run a job to process it, and produce some output Data. Work usually takes a long time (from a few minutes to a few days). batch jobs are typically run periodically (for example, once a day). The primary performance metric for a batch job is typically throughput .1.MapReduceBatch processing is an important part of building reliable, scalable, and maintainable Applications. The batch algorithm that Google released in 2004,MapReduce, is an important model for processing large-scale datasets, although mapreduce is a fairly low-level programming model compared to parallel processing systems specifically developed for the data warehouse. however, It is still a great help to understand the model of batch processing, so we start with MapReduce as a starting point for our batch computing Journey.Distributed Storage systems and MapReduceMapReduce is a rather blunt, barbaric tool, but very effective. Single MapReduce job: you can have one or more inputs and generate one or more outputs. The MapReduce job is a model of functional programming that does not modify the input and produces no side effects other than generating Output. The output files are written in order (without modifying any existing portions of the files that have been written).The MapReduce job requires a Distributed file system that reads and writes Files. Such as: Hdfs,gfs,glusterfs,amazon S3 and so On. We then use HDFs as the operating environment, but these principles apply to any distributed storage System. HDFs is based on a non-shared storage cluster, while shared disk storage is implemented by a centralized storage device, typically using custom hardware and a special network infrastructure such as fibre channel. So HDFs requires no special hardware, only computers that are connected by a traditional data center Network. running on each computer of the HDFs daemon will allow other nodes to access the data stored on that computer, while the central server Namenode keeps track of which file blocks are stored on which MACHINE. therefore, creating a large file on hdfs, you can use all the computers in the Cluster. To tolerate machine and disk failures, file blocks can be duplicated on multiple machines in the Cluster. So several copies of the same data on multiple machines, of course, can also use the Erasure code technique, which allows the loss of data to be stored at a lower storage cost than a full copy. The erasure code technique is similar to raid, which provides redundancy on multiple disks on the same MACHINE. The difference is that extra codec calculations are required to read and write the copy of the Erasure Code.The work flow of MapReduceThe main difference between mapreduce and the traditional Unix command pipeline is that MapReduce can be computed in parallel across multiple computers, and manually written mapper and reducer do not need to know where the input comes from or where the output is going, and the framework handles the complexity of moving data between machines.Shows the workflow of a mapreduce job, where the input to the job is a directory of hdfs, each file block in the directory as a separate partition, processed by a separate map task, the size of each input file is typically hundreds of megabytes ( depending on the block size of HDFs ). The MapReduce Scheduler attempts to run mapper on the storage input file copy block machine, as long as the machine has enough memory and CPU resources to run the map Task. in this way, the bandwidth of copying file blocks on the network is saved, the network load is reduced, and the local principle of distributed computing is Utilized. The application code is packaged into a jar file, uploaded to the distributed storage system, the corresponding node downloads the Application's jar file, then launches the map task and begins to read the input file, passing a record to the mapper callback function each time, and outputting the key-value pair after the map is Processed. The number of tasks for a map depends on the number of input file blocks, but the number of reduce tasks is configured by the job author , in order to ensure that all key-value pairs of the same key are handled by the same reducer, the framework uses a hash key to determine the reduce task that the key-value pair should correspond to.MapReduce needs to sort the Key-value pairs, but the dataset may be too large to be sorted with a regular sorting algorithm on a machine. therefore, each map task outputs a key-value pair to the corresponding reducer disk partition based on the hash and sorts the Key-value Pairs. whenever Mapper completes the work, the MapReduce Scheduler notifies reducer that they can start getting output files from Mapper. Reducer from the mapper end to obtain the corresponding output of the key value of the file, and to merge sort, maintain the sort order, this process is called Shuffle. finally, reducer calls the reduce function to handle these ordered Key-value pairs, and can generate any number of output records and write to the distributed storage System. This is the whole process of a complete mapreduce task.Chained scheduling for MapReduce jobsThe range of issues that a mapreduce job can solve is Limited. therefore, the MapReduce job needs to be linked to the workflow so that the output of the job becomes the input to the next job. The MapReduce framework for Hadoop can be implicitly linked by a directory name: the first MAPREDUC job configuration writes output to the specified directory in hdfs, and the second MapReduce job reads the same directory name as Input. In terms of the MapReduce framework, They are two separate jobs. The input of the next job is considered valid only if the current job completes successfully (the result of the failed MapReduce job is discarded). So different jobs will have a dependency on the direction of the graph to deal with these dependencies of the work execution, currently Hadoop has a lot of batch processing scheduler, such as: oozie,azkaban, Luigi, airflow, etc. In a large company, many different teams may run different jobs and they read each Other's output, so it is important to support complex data flows such as management through Tools.Business Scenarios for 2.MapReduce jobsWe use an example to understand the business scenario of the class MapReduce job in Detail. As Shown: on the left is a description of the behavior by logging, called User activity, and the right side is a user table for a database.The task of the data analyst may need to correlate user activity with the User's information: analyze which pages are most popular with age Groups. however, in the user activity log, only the User's ID is included, not the complete user Information. In this case, a join operation is required, the simplest implementation is to check the user activity one at a time, and query the user database for each user id, obviously, This implementation will bring bad Performance. The task throughput will be limited by the round-trip time of the database server, and the validity of the local cache will depend heavily on the distribution of the data, and running a large amount of queries in parallel may exceed the processing power of the data server. in order to have greater throughput during the job process, the calculations must be made (as far as Possible) on a single machine. Each record processed by a random access request over the network is very slow. In addition, querying the remote database will mean that the batch job becomes indeterminate because the data in the remote database can change at any Time. therefore, a better approach is to get a copy of the user database (using ETL to extract the data from the database into the "data Warehouse") and put it into a distributed storage System. In this way, we can use tools like MapReduce to handle them more efficiently. As shown: the mapper output is partitioned by the MapReduce framework key, and then when the Key-value pair is sorted, the effect is that all active events and users with the same user ID are recorded in the same reducer and are adjacent to each other. After that, Reducer can easily perform the actual join logic: each user ID calls the reduce function once, outputting the active URL and the User's age. You can then start a new mapreduce job to calculate the viewer age distribution for each url, grouped by age Group.next, Let's comb through some of the business-level details and some of the details of the MapReduce framework: <ul> <ul> <li>Separation of business logic In the above business scenario, the most important thing is to ensure that the same user ID of the activity needs to be pooled to the same reducer to handle , this is the previous text we chat to shuffle function, all key values of the same key value pairs will be passed to the same destination. The MapReduce programming model separates computational communication collaboration from application logic Processing. This is the genius of the MapReduce framework, which handles all network traffic by the MapReduce framework itself, where the business people focus on the implementation of the application code, and if there is a node failure in the process, the MapReduce transparent failure retries to ensure that the application logic is not Affected.</li> <li>Data grouping In addition to the join scenario, data is grouped by key-value pairs that are commonly used by data systems: all records with the same key are formed into a group, and then the data within the group is Manipulated. What's The problem now? How do we use MapReduce to implement such a grouping operation? the implementation is also very simple, through the map function to transform the key value pair, insert the key value pair to produce the expected grouping key, and then partition and sort the same key into the same reducer. when implemented on mapreduce, groupings and joins look very similar.</li> <li>Data skew If the amount of data associated with the same key is very large, it can be a challenge for the MapReduce framework because the same key is pooled into the same reducer for Processing. For example, in a social network, a few people may have millions of followers. (we have given this example in the first chapter) so there is data skew in the MapReduce operation, how to compensate? in pig, A sampling task is run to determine which key is hot, and when the job is actually executed, mapper will distribute a specified number of reducer by randomly selecting the key pair that appears to skew the Data. Another method is used to optimize the tilt connection of Hive. It needs to explicitly specify the hotkey in the table metadata, which stores the records associated with those keys in the metadata, followed by a pig-like optimization approach when the table is subsequently manipulated. </li> </ul> </ul>3. The meaning of batch processingThe work flow of the MapReduce job has been discussed earlier, and now we return to the question: what is the result of all the processing? Why do we have to do all this work from the start? the core of a batch operation is to parse the data in the data system, which involves scanning a large number of records, grouping and aggregating them, and outputting them to a database for presentation in a report, making data decisions for consumers or analysts through Reports.similarly, batch processing is appropriate for establishing a search Index. Google's Initial use of MapReduce was to build an index on its search engine, implemented through a workflow of 5 to 10 MapReduce jobs. If you need to perform a full-text search in a set of files, the batch process is a very efficient way to partition the data by each map task, and then each reducer to create a partitioned index that writes the index file to the Distributed file System. Because search indexes are read-only by keyword queries, These index files are immutable after they are created. if the indexed document Set changes, one option is to rerun the entire index workflow for the entire set of documents periodically, and replace the previous index file with the new index file when the new index file is Completed. (the Batch task does not work if only a small number of files are Changed)batch jobs that treat input as immutable and avoid side effects, such as writing to an external database, achieve good performance and become easier to maintain. If you introduce a bug in your code that outputs an error, you can simply roll back to the previous version of the code and rerun the job, and output the correct results again. A simpler solution, you can save the old output in a different directory, and then simply Switch. Due to this Easy-to-roll-back feature, feature development can be faster than in an environment that cannot be rolled back. Facilitates agile software Development. Batching separates the logical processing code from the configuration, allowing for elegant reuse of the Code: one team can focus on the logical processing, while the other teams can decide when and where to run the job.Summary:In this paper, we comb the processing framework of MapReduce and discuss the characteristics of many batch processing jobs. In addition to the MapReduce model, there are still many computational models of data processing in the data system, and then we will continue to explore the computational model in the data system ... MapReduce and batch------"designing data-intensive applications" reading notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More