How to build high performance Hadoop cluster for large data processing

Source: Internet
Author: User
Keywords High performance high performance large data high performance large data multiple high performance large data multiple through high-performance large data multiple through to large

More and more enterprises are using Hadoop to process large data, but the overall performance of the Hadoop cluster depends on the performance balance between CPU, memory, network and storage. In this article, we will explore how to build a high-performance network for the Hadoop cluster, which is the key to processing analysis of large data.

About Hadoop

"Big Data" is a loose set of data, and the constant growth of massive amounts of data forces companies to manage them in a new way. Large data is a large collection of structured or unstructured data types. Hadoop, however, is the software architecture that Apache publishes to analyze petabytes of unstructured data and transform it into a form that other applications can manage. Hadoop makes it possible for large data processing and can help enterprises to explore new business opportunities from customer data. If you can do real-time processing or close to real-time processing, it will provide users with a strong advantage in many industries.

Hadoop is designed based on Google's MapReduce and Distributed File system principles, and can be deployed on common network and server hardware and made into a computing cluster.

Hadoop model

Hadoop works by cutting a very large dataset into a smaller unit to be processed by the query. Compute resources for the same node are used for parallel query processing. When the task is processed, its processing results are summarized and reported to the user, or processed for further analysis or dashboard display through the Business Analysis application.

To minimize processing time, Hadoop "moves jobs to data" in this parallel architecture, rather than "moving data to Jobs", as in traditional mode. This means that once the data is stored in a distributed system, in real-time search, query, or data mining operations, such as access to local data, in the process of data processing, each node will have only one local query results, which reduces operating expenses.

The biggest feature of Hadoop is its built-in parallel processing and linear scalability, which provide queries to large datasets and generate results. In structure, Hadoop has two main parts:

The Hadoop Distributed File System (HDFS) cuts data files into chunks and stores them within multiple nodes to provide fault tolerance and high performance. In addition to the large number of aggregated I/O for multiple nodes, performance usually depends on the size of the block--such as 128MB. The typical data block size of a traditional Linux system may be 4KB.

MapReduce engine through the Jobtracker node to accept the analysis from the client, the "divide-and-conquer" way to break a larger task into a number of smaller tasks, and then assigned to each Tasktrack node, and the main station/from the station distribution mode (as shown in the following figure) :

The Hadoop system has three main functional nodes: client, host, and machine. The client injects the data file into the system, retrieves the results from the system, and submits the analysis through the host node of the system. The host node has two basic functions: managing the data storage of each node in the Distributed file system and from the machine node, and managing the Task tracking assignment and task processing of the map/reduce from the Machine node. The actual performance of data storage and analysis processing depends on the performance of the machine node of the running Data node and task Tracker, which is communicated and controlled by the respective host nodes. There are usually multiple blocks of data from a node and are assigned to handle multiple tasks during the job.

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.