[Go] Overview of distributed computing framework

Source: Internet
Author: User
Tags shuffle shuffle shuffle server memory

Originally published to The Science and technology online, who know that the manuscript has been rejected, then send it here.

0 Introduction

With the development of the Internet, [1] The arrival of the web2.0 period, human formally entered the era of the explosion of the explosive. A huge amount of information will appear in many applications, such as logging user behavior logs in some social networking applications, typically in gigabytes or even terabytes. Conventional stand-alone computing mode cannot support such a large amount of data. So, computing has to be distributed by dividing huge computational tasks into small single-machine computing tasks that can be sustained, in which case the distributed computing framework and cloud computing [2] appear.

1 Distributed Computing Framework background Introduction

As our Internet moves from Web 1.0 to today's Web 2.0, the amount of information on the Internet is growing exponentially. The amount of data generated by the Internet every day is growing in terabytes of data. Compared to the traditional storage and calculation of relational data, these daily data are mostly non-relational, and there is no fixed-format data. This creates a challenge to traditional relational-based data storage and computing [3].

Compared to traditional data calculations, the calculation of the data on a machine is fully supported before the Web2.0 period. Compared to the common server memory is 100G, all the computing data are cached in memory for scientific calculation is possible. But today, for some applications, the user logs are in terabytes, which is not possible to cache all the time in the memory, even if the server's memory can be expanded, but the operation cost is very large. At this time, there must be a certain computational mechanism to share computing tasks on multiple machines, so that each machine is responsible for a portion of the computing and data storage tasks. This reduces the need for single-machine configuration and can be scientifically calculated using ordinary machines.

However, for the development and maintenance of distributed computing, it is very complex and changeable to consider the situation. During the process of distributed computing, the communication of control information in the process of computation, the data acquisition of each task, the merging of calculation results and the rollback of error calculation are all necessary to ensure normal operation in distributed computing (4). If all of these tasks are the responsibility of the developer, this is a very high requirement for programmers. The emergence of distributed computing framework is to solve this bottleneck, and the development of distributed computing program is accomplished by encapsulating the computational details in distributed framework.

By using the distributed computing framework, programmers can easily enjoy the benefits of high-speed computing generated by distributed computing without having to control various problems and computational anomalies in the distributed computing process. This allows programmers to increase the efficiency of development exponentially.

In this paper, the current distributed computing framework is systematically reviewed and summarized.

2 Distributed Framework 2.1 Hadoop Distributed Computing Framework 2.1.1 Framework Introduction

The Hadoop computing framework is a fairly early distributed computing framework that is based on the development model of the MapReduce proposed by Google [5] The next open source implementation of a very powerful distributed computing framework, developed by Java. The Hadoop distributed computing framework consists of two parts, the computational framework MapReduce and the storage Framework HDFs (hadoopdistributed File System) used to store computational data.

2.1.2 Hadoop Task Execution Introduction

MapReduce is a computational architecture design that uses functional programming ideas to divide a computation into a map and reduce two computational processes. MapReduce divides a large computational task into smaller computational tasks, assigns each small computing task to each compute node in the cluster, and keeps track of each compute node's progress to determine whether to re-execute the task, and finally collect the results of the calculations on each node and output.

The MapReduce architecture is a master-slave structure design based on Jobtracker and Tasktracker. Jobtracker is responsible for specific task partitioning and task monitoring, and determines whether a task needs to be rolled back, Tasktracker is responsible for specific task execution, data acquisition for each task assigned to it, and maintains a status with Jobtracker communication. The calculation process of the output calculation results.

To the task input, the framework will first through the jobtracker of the task of slicing, the end of the partition sent to each tasktracker to perform the map task, the map after the end in order to achieve more balanced performance will perform shuffle shuffle operation, and finally perform the reduce operation, the output results. The specific task is performed as shown in the following:

2.1.3 HDFs Introduction

Distributed File System HDFs, which is a distributed file system that stores large files. HDFs has a high level of fault tolerance [6] and low requirements for machine equipment. HDFs divides each large file into a fixed-size block of data, which is stored evenly on different machines and then backed up for each data file to ensure that the data is not lost.

The HDFs cluster is also a master-slave architecture design based on the name node Namenode and the data node Datanode. The primary node name node is responsible for the storage of the data storage information of the whole cluster, there is only one name node in a cluster, and the node data node is responsible for the specific data storage, usually there are many in the cluster. As shown in the following:

2.2 Storm Framework 2.2.1 Framework Introduction

The Storm distributed computing framework is a distributed framework based on streaming computing that is developed by the Lisp language to deal with real-time big data. Its appearance to a certain extent the end of the Hadoop framework of the delay is relatively large, late program operation and maintenance of complex features, and it is not supported by Hadoop real-time, streaming computing and other characteristics. For some real-time data analysis, Storm has very high efficiency.

Storm compared to the Hadoop,storm has more functional components, but its main function is based on the Nimbus and supervisor two functional components to expand, through the zookeeper of the component life cycle monitoring. Nimbus similar to Hadoop jobtracker responsible for assigning tasks and monitoring the status of each task, supervisor is deployed on every working machine and is responsible for monitoring the machine and responsible for initiating the worker process on the machine.

2.2.2 Storm Task Execution Introduction

The storm task begins with a commit topology (topology), compared to the execution of Hadoop as a task (job). Unlike Hadoop task execution, this topology is always circulating in the framework unless you manually intervene to stop the task flow. Each topology is performed on a specific worker process worker, and the worker is communicating using ZEROMQ[7] Message Queuing to improve communication performance.

Storm specific task process through the client to submit a well-declared topology, Nimbus by interacting with zookeeper to obtain the appropriate running machine, and then assign the task to the specific machine, The supervisor on the machine starts the task by starting the corresponding worker process based on the assigned task. During execution, both supervisor and each worker will maintain a heartbeat connection with zookeeper. The specific execution process is as follows:

2.3Spark Distributed Computing Framework 2.3.1 Framework Introduction

SPARK[8] is a recently popular distributed computing framework using Scala-written, Rdd[9] (resilient distributed Datasets) elastic distributed memory datasets. The framework solves the disadvantage of low efficiency in performing iterative tasks in the Hadoop computing Framework, which also provides interactive query of tasks during task execution, which increases the controllability of tasks. More operations are available than Hadoop,spark, in addition to the method calls that provide calculations.

Compared to the versatility of Spark and Hadoop, the spark framework has certain specificity for some specific algorithms. Because spark caches the input data, it does not have to be repeated for each calculation, which can be very useful for computational acceleration. For some calculations that require iteration, the entire calculation can be done quickly by caching the intermediate data. With spark command, for some iterations of work, such as the Kmeans algorithm, it will increase by about 20 times times the speed. In addition, Spark is still controllable for data cached in memory, and when there is not enough memory available, you can choose to cache a certain percentage of the data. This gives the framework a greater sense of self-help.

2.3.2 Task Execution Introduction

The task execution framework of Spark is also in the master-slave mode to the task scheduling, its task execution framework is composed of main structure master and subordinate structure workers, the specific task execution is in driver way. The user-developed program connects Master in a driver manner and specifies the creation and transformation of the RDD for the data set, and then sends the RDD operation to the task execution node workers. Workers is performing specific tasks that also store the data required for the calculation, and when the operation is received for the RDD, workers generates the desired result by manipulating the localized data by the operation definition received, and finally returns or stores the result.

3 Framework Comparison 3.1 Framework architecture comparison

For three kinds of distributed framework, although the framework is based on the master-slave structure, the structure of different distributed computing frameworks is different in detail. A good architecture design not only makes the framework more maintainable later, but also makes it easier for developers to master the mechanics of the framework, and can be optimized for performance [10]. The architecture comparison of three distributed computing frameworks is shown in the following table:

Table 1 Schema comparisons

Tab. 1 Architectures comparation

Frame name

Architecture Design

Store

Communication

Hadoop

Jobtracker/trasktacker

Hdfs

Rpc/http

Storm

Nimbus/supervisor

Real-time input stream

ZEROMQ Message Queuing

Spark

Master/workers

Memory, Disk

Share, broadcast variables

3.2 Frame Performance Comparison

Three distributed frameworks are used on a large scale in different fields of industry, and different frameworks have their own most suitable computing scenarios [11], and the performance comparisons of three computational frameworks are shown in the following table:

Table 2 Performance Comparisons

tab. 2  performance comparation

Frame name

Advantage

Malpractice

Usage

Hadoop

Java writing performance high

Singo;

Process process fixed

batch processing;

insensitive to latency;

Offline data processing

Storm

Receive data stream in real time,

Higher fault tolerance;

Development is simple;

Depends on other components more;

Memory control is not good;

Multi-lingual support for

Realtime;

Stream data processing;

Distributed RPC calculation

Sp Ark

algorithm implementation is simple;

Data buffer memory;

Calculation method is more general;

Task execution can interact

requires large memory;

Incremental update efficiency is poor

Batch;

iterative tasks;

Most big Data processing tasks

4 other frameworks

In addition to these commonly used frameworks, there are many distributed computing frameworks that play a big role in all fields. When the Hadoop framework appeared, Microsoft introduced the Distributed Computing Framework DRYAD[12] and dryadlinq[13]. There are a number of distributed computing frameworks that are similar to storm-based real-time data streams, such as the S4 Computing framework of Yahoo, and the Berkeley D-STREAMS[14] computing framework, and Hadoop also puts forward the concept of data flow-based implementations as hstreaming.

The literature [15] gives some prospects for future cloud computing frameworks, and the literature [16] gives the impact of the distributed computing framework on current project maintenance and forecasts the future distributed computing framework for software maintenance, and the literature [17] gives a summary of the current advances in cloud computing and the further development direction of future cloud computing. In the future framework development, the data volume will certainly be larger than the current scale of magnitude [18], the scalability of the computational framework has a large test, the calculation of time consumption has a certain limit, the complexity of the data will be more difficult to control [19], the framework of the architecture [20] and calculation mode will be higher requirements. [21] For specific applications, different distributed frameworks also need to be optimized and upgraded for the respective applications.

In the future development, combined with the current research direction, the development direction of the distributed framework will be in the following types of expansion:

1) The Distributed computing framework is optimized for a more streamlined architecture, with a clearer architecture, and Hadoop's second generation of distributed computing framework yarn is optimized for the Hadoop architecture. With a good architecture design, the framework is easier to maintain and the calculation process clearer.

2) in the future of distributed computing architecture, the computational model will be more optimized, from the current distributed computing framework can be seen from the batch computing Hadoop to streaming computing storm and then to functional programming of spark. With a good computing model, it is easier and more convenient to develop applications on the framework.

3) The infrastructure of the distributed computing framework will also be researched to some extent to support the distributed computing over-frame of the upper layer. In big Data computing, the transfer of data distributed across different machines costs a great price, so the development of the infrastructure will also promote the performance of the distributed computing framework.

5 Conclusion

This paper makes a systematic review of the existing distributed computing framework in the current Internet, and gives a detailed introduction to the architecture and calculation process of different computing frameworks. Then the different distributed computing framework from the storage of computing data to the calculation process of the data communication in detail, and from the performance of the different framework of the comparison, to obtain the advantages and disadvantages of different frameworks, and give different computing framework in different applications. Finally, this paper looks forward to the future development direction of the distributed computing framework.

Reference Links: A survey of distributed computing frameworks

[Go] Overview of distributed computing framework

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.