How to build high performance Hadoop cluster for large data processing

Source: Internet
Author: User
Keywords High performance server large data

More and more enterprises are using Hadoop to process large data, but the overall performance of the Hadoop cluster depends on the performance balance between CPU, memory, network and storage. In this article, we will explore how to build a high-performance network for the Hadoop cluster, which is the key to processing analysis of large data.

About Hadoop

"Big Data" is a loose set of data, and the constant growth of massive amounts of data forces companies to manage them in a new way. Large data is a large collection of structured or unstructured data types. Hadoop, however, is the software architecture that Apache publishes to analyze petabytes of unstructured data and transform it into a form that other applications can manage. Hadoop makes it possible for large data processing and can help enterprises to explore new business opportunities from customer data. If you can do real-time processing or close to real-time processing, it will provide users with a strong advantage in many industries.

Hadoop is designed based on Google's MapReduce and Distributed File system principles, and can be deployed on common network and server hardware and made into a computing cluster.

Hadoop model

Hadoop works by cutting a very large dataset into a smaller unit to be processed by the query. Compute resources for the same node are used for parallel query processing. When the task is processed, its processing results are summarized and reported to the user, or processed for further analysis or dashboard display through the Business Analysis application.

To minimize processing time, Hadoop "moves jobs to data" in this parallel architecture, rather than "moving data to Jobs", as in traditional mode. This means that once the data is stored in a distributed system, in real-time search, query, or data mining operations, such as access to local data, in the process of data processing, each node will have only one local query results, which reduces operating expenses.

The biggest feature of Hadoop is its built-in parallel processing and linear scalability, which provide queries to large datasets and generate results. In structure, Hadoop has two main parts:

The Hadoop Distributed File System (HDFS) cuts data files into chunks and stores them within multiple nodes to provide fault tolerance and high performance. In addition to the large number of aggregated I/O for multiple nodes, performance usually depends on the size of the block--such as 128MB. The typical data block size of a traditional Linux system may be 4KB.

MapReduce engine through the Jobtracker node to accept the analysis from the client, the "divide-and-conquer" way to break a larger task into a number of smaller tasks, and then assigned to each Tasktrack node, and the main station/from the station distribution mode (as shown in the following figure) :

The Hadoop system has three main functional nodes: client, host, and machine. The client injects the data file into the system, retrieves the results from the system, and submits the analysis through the host node of the system. The host node has two basic functions: managing the data storage of each node in the Distributed file system and from the machine node, and managing the Task tracking assignment and task processing of the map/reduce from the Machine node. The actual performance of data storage and analysis processing depends on the performance of the machine node of the running Data node and task Tracker, which is communicated and controlled by the respective host nodes. There are usually multiple blocks of data from a node and are assigned to handle multiple tasks during the job.

Deploy implementation Hadoop

The main requirements of each node hardware are the balance of four resources, such as City and county computing, memory, network and storage. The most commonly used and known as the "best" solution is to deploy a relatively low cost of old hardware, deploying enough servers to handle any possible failures, and deploying a complete rack system.

Hadoop mode requires the server to have direct-attached storage (DAS) with Sans or NAS. There are three main reasons to use Das, and in a standardized cluster, the nodes are scaled in thousands, with the storage system cost, low latency, and storage capacity requirements increasing, simple configuration and deployment of a major consideration. With the availability of extremely cost-effective 1TB disks, terabytes of large clusters of data can be stored on DAS. This solves the traditional approach of deploying a SAN for extremely expensive deployments, and so much storage makes Hadoop and data storage a daunting starting cost. A significant portion of the user's Hadoop deployment is built with a large capacity Das server, where the data nodes are approximately 1-2TB, and the name control nodes are approximately between 1-5TB, as shown in the following illustration:

Source: Brad Hedlund, Dell Inc.

For most Hadoop deployments, other factors affecting the infrastructure may also depend on accessories such as Gigabit Ethernet cards or Gigabit Ethernet switches built into the server. The previous generation of hardware such as CPU and memory choice, according to the requirements of the cost model, the matching data transmission rate requirements of Gigabit Ethernet interface to build low-cost solutions. Using a Gigabit Ethernet to deploy Hadoop is also a pretty good choice.

The effect of Gigabit Ethernet on Hadoop cluster

Gigabit Ethernet Performance is a major constraint on the overall performance of the Hadoop system. Using a larger block size, for example, if one node fails (or worse, the entire rack goes down), the entire cluster needs to recover terabytes of data, which can exceed the network bandwidth that Gigabit Ethernet can provide, thereby reducing the performance of the entire cluster. In a large cluster of tens of thousands of nodes, a Gigabit Ethernet device may experience network congestion while the system is running normally, when there is a workload that requires intermediate results to be allocated between data nodes.

The goal of each Hadoop data node must be to balance CPU, memory, storage, and network resources. If any one of the four is relatively poor performance, then the potential processing capacity of the system is likely to encounter bottlenecks. Adding more CPUs and memory builds will affect the balance of storage and the network, how to make Hadoop cluster nodes more efficient at processing data, reduce results, and add more HDFS storage nodes within the Hadoop cluster.

Fortunately, Moore's law, which affects the development of CPU and memory, is also affecting the development of storage technology (terabytes of disk) and Ethernet technology (from gigabit to mega-even higher). Pre-upgrade system components such as multi-core processors, disk per-node 5-20TB capacity, 64-128GB memory, and network components such as Gigabit Ethernet cards and switches are the most reasonable options for rebalancing resources. Gigabit Ethernet will prove its value in the Hadoop cluster, and a high level of network utilization will bring more efficient bandwidth. The following figure shows the connection between the Hadoop cluster and the Gigabit Ethernet:

Many enterprise-level data centers have migrated to 10GbE networks to achieve server consolidation and server virtualization. As more and more companies start deploying Hadoop, they find that they do not have to deploy 1U of rack servers in large quantities, but deploy fewer, but higher-performance servers to facilitate the expansion of the number of tasks that each data node can run. Many enterprises choose to deploy 2U or 4U servers (such as the Dell PowerEdge C2100), with approximately 12-16 cores per node and 24TB storage capacity. A reasonable choice in this environment is to take full advantage of the 10GbE devices already deployed and the 10GbE network adapters in the Hadoop cluster.

Build a simple Hadoop cluster in a day-to-day IT environment. To be sure, although there are many details that need fine-tuning, the basics are very simple. Building a system that balances computing, storage, and network resources is critical to the success of the project. For Hadoop clusters with dense nodes, Gigabit Ethernet can provide the ability to match computing and storage resource extensions without causing overall system performance degradation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.