In the era of big data, IT vendors researching big data focused their research on optimizing the software architecture of big data systems, optimizing business logic, optimizing data analysis algorithms and optimizing node performance, while ignoring the evaluation of network links in the big data environment infrastructure And optimized. This article introduces Cisco's network architecture design and optimization experience in a Hadoop cluster environment.
Big Data Hadoop Environments Network Features Each node in a Hadoop cluster is connected via the http://www.aliyun.com/zixun/aggregation/18415.html "> network and the following procedure in MapReduce transports data across the network.
(1) write data. When writing initial data or chunk data to HDFS, a data writing process occurs. The data blocks written need to be backed up to other nodes and need to be transmitted over the network.
(2) job execution.
① Map stage. In the Map phase of the algorithm, there is little need to transfer data in the network. At the beginning of the Map, data needs to be transmitted over the network when HDFS data is not local (data blocks are not stored locally and need to be copied from other nodes).
② Shuffle stage. This is the phase of data transfer in the network during job execution, and the extent of data transfer depends on the job. Mapper stage of the output content, will be transmitted to the Reducer at this time to sort.
③ Reduce stage. Because the Reducer needs data that has come from the Shuffle stage, there is no need for the network to transfer data at this stage.
④ Output copy. MapReduce's output is stored as a file on HDFS. When writing the output to HDFS, the resulting backup is transmitted over the network.
(3) read the data. Data reading takes place when an application reads data from HDFS, such as a website, index, or SQL database. In addition, the control layer Hadoop network is very important, such as HDFS signaling and operation and maintenance operations, and MapReduce architecture are subject to the network.
Five network characteristics
Cisco tested a network environment in a Hadoop cluster environment. The test results show that a flexible network is very important to a Hadoop cluster. Network features that have a significant impact on the Hadoop cluster are ordered by their degree of influence It is: Network availability and resiliency, burst Burst traffic and queue depth, network overload ratio, Datanode network access, and network latency.
(1) network availability and flexibility. To deploy a high-redundancy and scalable network to support the Hadoop cluster growth. Techniques for deploying multiple links between Datanodes are better than those that have a single point of failure or two points of failure. Switches and routers have been proven in the industry to provide servers with network availability.
(2) Burst traffic burst processing and queue depth. Some HDFS operations and MapReduce jobs generate bursts of traffic, such as loading files into HDFS or writing result files to HDFS. If the network can not handle the burst traffic, it will discard the packet, so proper buffering can relieve the impact of burst traffic. Be sure to choose switches and routers that use caching and queuing to effectively handle traffic bursts.
(3) network overload ratio. A good network design needs to consider the congestion of the key nodes in the network. A ToR switch receives 20Gbps data from the server, but only two 1Gbps uplink ports can cause packet loss (10: 1 overload ratio), seriously affecting cluster performance. Overconfigured networks are expensive. Under normal circumstances, the server access layer overload ratio can be accepted around 4: 1, access layer and convergence layer, or the core layer overload ratio of about 2: 1.
(4) Datanode network access. The bandwidth configuration is recommended based on the cluster workload. The nodes in a typical cluster have 1 to 2 1GB uplink ports. Whether to choose 10Gbps server to weigh the price and performance.
(5) network delay. Changes in switch and router latency have limited impact on cluster performance. Application layer latency has a greater impact on tasks than network latency. However, the latency of the network can have a potential impact on the application system, such as causing unnecessary application switching.