The traditional MapReduce framework is slow down there.

Last Update:2017-01-17 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why the previous MapReduce system is slow

There are a few common reasons why the MapReduce framework is slower than the MPP database:

The expensive data manifested overhead introduced by fault tolerance (data materialization) .
Weak data layouts (data layout) , such as missing indexes.
The cost of executing the policy [1 2].

Our experiments with hive have further proven the above, but partial DAG execution this gap can be narrowed by the "engineering" improvements to hive, such as changing the storage engine (memory storage engine) and improving the execution architecture. At the same time, we also find that some of the details of MapReduce implementations have a huge impact on performance, such as the overhead of task scheduling, which greatly improves load balancing if the scheduling overhead is reduced.

Intermediate result output: a MapReduce-based query engine, similar to hive, tends to manifest intermediate results (materialize) on disk:

Within the MapReduce task, map usually stores the results on disk in order to prevent the reduce task from failing.
Often, when translating to a mapreduce task, some queries often produce multiple stages, which in turn rely on the underlying file system (such as HDFS) to store the output of each stage.

In the first case, the output of map is stored on disk to ensure that there is enough space to store the output of these large data bulk tasks. The map output is not copied to a different node, so if the node performing the map task fails, it will still cause data loss [3]. As a result, it is reasonable to cache this part of the output data in memory, rather than outputting it all to disk. The implementation of the Shark shuffle is precisely the application of this inference, which stores the output of the map in memory, greatly improving the throughput of the shuffle. Typically, for (aggregation) queries such as aggregation and filtering, their output is often much smaller than the input, which is a very reasonable design. The popularity of SSDs will also greatly improve the performance of random reads, for large data volume of the shuffle, can achieve greater throughput, but also have more space than memory.

For the second case, some execution engines extend the MapReduce execution model, generalizing the MapReduce execution model into a more generic execution plan diagram, which (task DAG) can be executed in tandem without having to output the intermediate results of the stage to HDFs. These engines include dryad[4], tenzing[5] and spark[6].

data format and layout (layout) : due to the processing of MapReduce's simple schema-on-read , it can cause a large processing overhead, Many systems have designed and used more efficient storage structures within the MapReduce model to speed up queries. Hive itself supports a "partitioned table (table partitions) " (a basic class indexing system that stores specific key segments in a particular file, avoids scanning the entire table), and is similar to the column storage structure of disk data [7]. In shark we took a step further in using the memory-based columnstore structure, Shark did not modify spark's code when implementing this structure, but rather simply stored a set of column tuples as a record within spark, while the structure within a column tuple was shark responsible for parsing.

Another unique feature of Spark is the ability to control the partitioning of data on different nodes, which provides a new function for shark: to partition a table in a federated way (co-partition) .

Finally, for the rdd we have not yet tapped its random read capability, although for write operations, the RDD can only support coarse-grained operations, but for read operations, the RDD is accurate to every record [6], which allows the RDD to be used as an index, Tenzing You can use this as a remote query table for join operations (remote-lookup) .

Execution Policy: Hive takes a lot of time to sort the data before it shuffle, and it takes a lot of time to output the MapReduce results to HDFs, which is limited by the basic, single iteration of the MapReduce model of Hadoop itself. For a more generic execution engine like spark, you can mitigate the overhead of these problems. For example, spark supports hash-based distributed aggregation and a more generic task execution plan diagram (DAG) .

In fact, in order to really optimize the execution of relational queries, we find it necessary to choose execution plans based on data statistics. But because of the existence of UDFs and complex analytic functions, and shark it as a class citizen (first-class citizens) , this statistic becomes very difficult. To solve this problem, we present the partial DAG Execution (PDE), which enables spark to change the subsequent execution plan graph based on the data statistics, PDE and other systems (DRYADLINQ) The runtime execution plan diagram overrides differ in that it collects fine-grained statistics in the range of key values, and can completely re-select the execution strategy of a join, such as broadcast join, rather than just selecting the number of reduce tasks.

Overhead of Task scheduling: Probably the most surprising part of the impact of shark is a purely engineering problem: the overhead of running a task. Traditional MapReduce systems, such as Hadoop, are designed to run for hours of batch work, and each task that makes up a job has a few minutes of running time, and they perform tasks in a separate system process, and in some extreme cases the latency of submitting a task is very high. For example, Hadoop uses a periodic "heartbeat" message to assign tasks to the work node, which is 3 seconds, so the total task initiation delay will be as high as 5-10 seconds. This is obviously tolerable for batch systems, but this is obviously not enough for real-time queries.

To avoid this problem, Spark uses an event-driven RPC class library to start the task by reusing worker processes to avoid system process overhead. It can start thousands of tasks in a second, and the delay between tasks is less than 5 milliseconds, making it possible to work with 50-100 milliseconds and 500 milliseconds. This improvement also surprises us with the improved query performance and even the improved query performance for longer execution times.

Sub-second tasks enable the engine to better balance the assignment of tasks between working nodes, even under the condition that some nodes experience unpredictable delays (network latency or JVM garbage collection). It also has great help with data skew, considering that hash aggregation on 100 cores (hash aggregation) requires careful selection of the key ranges processed for each task, and any skewed portions of the data will slow down the entire job. However, if you distribute the job to 1000 cores, the slowest task will only be 10 times times slower than the average task, which greatly increases the acceptability. When we apply the Tilt-aware selection strategy in PDE, we are disappointed that this strategy has a smaller increase than the increase in the number of reduce tasks. It is undeniable, however, that the engine has a higher level of stability for abnormal data skew.

In Hadoop/hive, the number of incorrect selection tasks tends to be 10 times times slower than an optimized execution strategy, so there is a lot of work focused on how to automatically select the number of reduce tasks [8 9], and you can see hadoop/hive and spark The impact of the number of reduce tasks on the job execution time. Because spark jobs can run thousands of reduce tasks with a small overhead, the impact of data skew can be reduced by running more tasks.

In fact, we haven't explored the feasibility of upper second-level tasks for larger clusters (tens of thousands of nodes). However, for DREMEL[10] such a system that periodically runs sub-second jobs on thousands of nodes, the scheduling policy can delegate the task to the "secondary" master node of the subset group when a single primary node is unable to meet the scheduling speed of the task. At the same time, fine-grained task execution strategies are not only the benefits of load balancing compared to coarse-grained design, but also include rapid recovery (fast recovery) (by distributing failed tasks to more nodes) and the elasticity of Queries (query elasticity) .

Fine-grained task model (Fine-Grained Task Modle)Other benefits to be brought

While this article focuses on the fault-tolerant benefits of the fine-grained task model, this model also offers many tempting features, which will be described in the two features that have been proven in the MapReduce system.

Elasticity of (Elasticity)： In a traditional MPP database, once a distributed execution plan is selected, the system must execute the entire query with this degree of parallelism. However, in a fine-grained task system, nodes can be removed or deleted during query execution, and the system will automatically distribute the blocked jobs to other nodes, which makes the whole system very scalable. If the database manager needs to remove some nodes from the system, the system can simply treat these nodes as failed nodes, or a better approach is to replicate the data on those nodes to other nodes. In contrast to the deletion of nodes, the database system can dynamically request additional resources to improve computing power when execution of queries becomes slower. Amazon's elastic mapreduce[11] has supported runtime sizing of the cluster.

Multitenant Architecture (Multitenancy) : The multi-tenancy architecture, like the above mentioned scalability, is designed to dynamically share resources among different users. In the traditional MPP database, when an important query is submitted, a large query has occupied most of the cluster resources, then the choice can be done is to cancel the previous query and other limited operations. In a system based on a fine-grained task model, the query job can wait a few seconds until the current job is complete, and then submit a new query job. [12 13] Facebook and Microsoft have developed a fair scheduler for Hadoop and Dryad that enables large, computationally intensive history queries and real-time small queries to share cluster resources without starvation.

The traditional MapReduce framework is slow down there.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More