Design and optimization of network architecture in Hadoop cluster environment

Source: Internet
Author: User
Keywords Large data Hadoop
Tags .net access analysis application business business logic cisco connected

In large data age, it vendors who study large data focus on optimizing large data system software architecture, optimizing business logic, optimizing data analysis algorithm, optimizing node performance, and ignoring the evaluation and optimization of network links in large Data environment infrastructure. This paper introduces the experience of Cisco Network Architecture design and optimization in Hadoop cluster environment.

  

Large data Hadoop Environment network characteristics the nodes in the Hadoop cluster are connected through the network, and the following procedures in MapReduce transfer data across the network.

(1) write the data. The data write process occurs when the initial data or large chunks of data are written to HDFs. The data blocks that are written need to be backed up to other nodes and need to be transferred over the network.

(2) Job execution.

①map stage. In the map phase of the algorithm, there is little need to transmit data in the network. At the start of the map, data is transmitted over the network when the HDFs data is not native (the data block is not stored locally and needs to be copied from other nodes).

②shuffle stage. This is the stage in which data is transmitted in the network during job execution, and the degree of data transfer depends on the job. The output of the Mapper phase is transferred to the reducer for sorting at this time.

③reduce stage. Since the data required by reducer is already coming from the shuffle phase, no network transfer data is required at this stage.

④output replication. The MapReduce output is stored as a file on the HDFs. When the output is written to HDFs, the resulting backup is transmitted over the network.

(3) Read the data. The process of data reading occurs when an application reads data from HDFs, such as a Web site, index, or SQL database. In addition, the network is very important to the control layer of Hadoop, such as HDFS signaling and operational operations, and MapReduce architectures are affected by the network.

Five kinds of network characteristics

Cisco has conducted a test on the network environment in Hadoop cluster environment, and the test results show that an elastic network is very important to Hadoop cluster, and the network features that have important influence on Hadoop cluster are sorted in order of their influence from large to small: network availability and elasticity, Burst Traffic burst processing and queue depth, network overload ratio, Datanode network access and network latency.

(1) Network availability and resilience. To deploy a highly redundant and scalable network to support the growth of the Hadoop cluster. The techniques for deploying multiple links between Datanode are better than those with single point of failure or two-point failure. Switches and routers have proven to be able to provide network availability to servers in the industry.

Article Burst flow burst processing and queue depth. Some HDFS operations and mapreduce jobs generate bursts of traffic, such as loading files to HDFS or writing the resulting file to HDFS through the network. If the network can not handle the burst flow, it will discard the packet, so the appropriate cache to mitigate the impact of sudden traffic. Ensure that the switches and routers that use caching and queues are selected to effectively handle traffic bursts.

(3) network overload ratio. A good network design needs to consider the congestion of the key nodes in the network. A tor switch receives 20Gbps of data from the server, but only 2 1Gbps of ports can cause packet loss (10:1 overload ratio), which seriously affects the performance of the cluster. The price of an overly-configured network is expensive. In general, the server access layer can be acceptable overload ratio of about 4:1, the access layer and the convergence layer, or the core layer of overload than 2:1.

(4) Datanode Network access. The bandwidth configuration is recommended based on the cluster workload. Nodes in a general cluster have 1 to 2 ports of 1GB. Choosing a 10Gbps server weighs price and performance.

(5) Network delay. Changes in switch and router latency have limited impact on cluster performance. Compared with network latency, application layer latency has a greater impact on tasks. But the latency of the network can have a potential impact on the application system, such as unnecessary application switching.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.