How to build high performance Hadoop cluster for large data processing

Last Update:2014-12-09 Source: Internet

Author: User

Keywords High performance high performance large data high performance large data multiple high performance large data multiple through high-performance large data multiple through to large

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

More and more enterprises are using Hadoop to process large data, but the overall performance of the Hadoop cluster depends on the performance balance between CPU, memory, network and storage. In this article, we will explore how to build a high-performance network for the Hadoop cluster, which is the key to processing analysis of large data.

About Hadoop

"Big Data" is a loose set of data, and the constant growth of massive amounts of data forces companies to manage them in a new way. Large data is a large collection of structured or unstructured data types. Hadoop, however, is the software architecture that Apache publishes to analyze petabytes of unstructured data and transform it into a form that other applications can manage. Hadoop makes it possible for large data processing and can help enterprises to explore new business opportunities from customer data. If you can do real-time processing or close to real-time processing, it will provide users with a strong advantage in many industries.

Hadoop is designed based on Google's MapReduce and Distributed File system principles, and can be deployed on common network and server hardware and made into a computing cluster.

Hadoop model

Hadoop works by cutting a very large dataset into a smaller unit to be processed by the query. Compute resources for the same node are used for parallel query processing. When the task is processed, its processing results are summarized and reported to the user, or processed for further analysis or dashboard display through the Business Analysis application.

To minimize processing time, Hadoop "moves jobs to data" in this parallel architecture, rather than "moving data to Jobs", as in traditional mode. This means that once the data is stored in a distributed system, in real-time search, query, or data mining operations, such as access to local data, in the process of data processing, each node will have only one local query results, which reduces operating expenses.

The biggest feature of Hadoop is its built-in parallel processing and linear scalability, which provide queries to large datasets and generate results. In structure, Hadoop has two main parts:

The Hadoop Distributed File System (HDFS) cuts data files into chunks and stores them within multiple nodes to provide fault tolerance and high performance. In addition to the large number of aggregated I/O for multiple nodes, performance usually depends on the size of the block--such as 128MB. The typical data block size of a traditional Linux system may be 4KB.

MapReduce engine through the Jobtracker node to accept the analysis from the client, the "divide-and-conquer" way to break a larger task into a number of smaller tasks, and then assigned to each Tasktrack node, and the main station/from the station distribution mode (as shown in the following figure) ：

The Hadoop system has three main functional nodes: client, host, and machine. The client injects the data file into the system, retrieves the results from the system, and submits the analysis through the host node of the system. The host node has two basic functions: managing the data storage of each node in the Distributed file system and from the machine node, and managing the Task tracking assignment and task processing of the map/reduce from the Machine node. The actual performance of data storage and analysis processing depends on the performance of the machine node of the running Data node and task Tracker, which is communicated and controlled by the respective host nodes. There are usually multiple blocks of data from a node and are assigned to handle multiple tasks during the job.

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More