Design principle of reference design for Hadoop integrated machine

Last Update:2014-12-24 Source: Internet

Author: User

Keywords Reference Design Hadoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a highly scalable, large data application that can handle dozens of TB to hundreds of PB of data through fewer than thousands of interconnected servers. This reference design realizes a single cabinet of Hadoop cluster design, if users need more than one cabinet of Hadoop cluster, can expand the design of the number of servers and network bandwidth easy to achieve expansion.

Hadoop scenario

Design of integrated Hadoop machine

Features of the Hadoop scheme

Hadoop is a low-cost and highly scalable large data processing platform. Hadoop provides a stable shared storage and analysis system, which is implemented by HDFS (distributed data storage), and data processing is implemented by MapReduce (distributed processing), while Hadoop provides hbase as a real-time database and numerous application tools. The Hadoop system is a distributed platform that consists of hundreds of servers, with some data stored on each server, and some data operations performed.

The composition of Hadoop cluster system

Hadoop server Role

HDFS (Distributed data storage):

Distributed file system with high tolerance and high throughput for large-scale data. It can be built from several to thousands of clusters of regular servers, and provides file read and write access to high aggregate input and output.

Main Features:

Building high reliability and fault-tolerant systems with low-cost storage and servers, automatic data replication, self-healing

Supports GB to TB level large data files, providing PB-level storage capacity

Streamline "consistency" for streaming data access optimizations for one-write, multiple-read

High aggregation bandwidth, high concurrent access

Moving "computing" is cheaper than moving "data", providing the ability to store and compute the same node data

Name node and Datanode

A HDFs cluster is composed of a namenode and multiple datanodes.

Namenode is a central server responsible for managing the file system's namespace (namespace) and client access to files, and is the controller and manager of all HDFs metadata. Namenode performs namespace operations on file systems, such as opening, closing, renaming files, or directories.

Datanode is typically a node that manages the storage on its node. Datanode is responsible for processing and reading requests from file system clients.

Map Reduce (Distributed data Processing):

A distributed framework for large data processing can be distributed synchronously across a server cluster. It is designed for off-line data analysis, uses the data parallelism to carry on the distribution operation, then summarizes the result computation frame.

Basic Features:

Splitting, distributing, and summarizing tasks in the framework logic, developers only need to implement business logic

Distributed task automatic failure retry, single task unexpected failure does not cause entire task to launch

and HDFs consolidation to move the calculation to the node where the data resides

Jobtracker is one of the most important classes in the MapReduce framework, where all job execution is scheduled and only one jobtracker or one job tracker is configured in the Hadoop system plus one backup Jobtracker Implement the HA application of MapReduce. They are scheduled by a master service Jobtracker and multiple slaver services running on multiple nodes Tasktracker two classes.

Hbase (Distributed database)

HBase is a distributed, storage-by-column, multidimensional table-structured real-time distributed database. It can provide high speed data structure and unstructured data reading and writing operations, designed for high speed online service. Main Features:

Supports tens of thousands of levels of high speed concurrent writes and high concurrent queries per second

Scalable, automatic data segmentation and distribution, dynamic expansion, no downtime

Data is stored on the HDFs Distributed File system and is not lost

Flexible table structure that can be dynamically changed and added (including rows, columns, and timestamps)

Column-oriented, compressible, effectively reducing disk I/O, increasing utilization

Multidimensional table, four dimensions, where three dimensions are variable, suitable for describing complex nesting relationships

Network interconnection:

The Hadoop cluster structure consists of a two-tier network topology. To get the most out of Hadoop, it's important to configure Hadoop, which includes the network topology. For multiple cabinet clusters, we need to map nodes to cabinets, and by mapping, when placing mapreduce tasks in the node, Hadoop will take precedence over the cabinet transmission rather than the outside of the cabinet. HDFs can place replicas more intelligently, weighing on performance and resilience. Network locations, such as nodes and cabinets, can be represented as a tree that reflects the "distance" between locations in the network. The name node uses a network location when deciding where to store a copy of the block, and when a map task is assigned to a tasktracker, the Jobtracker node uses the network location to determine where the most recent copy is entered as a map task.

Gigabit and Gigabit Ethernet is currently the most commonly used in Hadoop network technology, in the cabinet, the use of Gigabit Ethernet connections to each node, and cabinets can be interconnected through Gigabit Ethernet. Future, with the cost of Gigabit Ethernet down, Gigabit Ethernet will be used in the cabinet-level exchange above, Hadoop itself can support other network interconnection technology, such as the Unlimited Network (Infiniband) for the need for a very low latency application requirements, but usually Ethernet can meet the majority of customer applications.

Hardware platform Selection

Hadoop does not need to run on expensive and highly reliable hardware. It is designed to run on a common dual server cluster and uses a large number of low-cost SATA hard drives, and its performance IO and data processing are implemented by aggregate performance, enabling better processing power or storage performance by reasonably expanding the number of nodes in the cluster or increasing the number of hard disks. At the same time, the Hadoop system realizes the fault-tolerant to the hardware, the data stored in the cluster or the processing task is not lost because of the fault of the individual hardware. This design further reduces the reliance on special hardware fault-tolerant technologies and lowers deployment costs.

For the various functions of the Hadoop cluster, consider the following server and network design:

Hadoop Server Design Requirements

Namenode is responsible for coordinating data storage in the cluster, Jobtracker coordinating data computing tasks, the final node type is secondly Namenode, small cluster it can and Namenode share a machine, Larger clusters can use the same hardware as the Namenode nodes, which require fast response, low latency, and high reliability, and we recommend that customers use the Intel Xeon E5 dual platform server to run Namenode, secondly Namenode and JOBTRACKERS,48GB memory, with SSD local storage and Enterprise-class RAID10 disks.

For a cluster of 100 datanodes, which requires processing capabilities to match I/O performance, large storage capacity, and high network bandwidth requirements, we recommend using the Intel Xeon E5 dual platform server to run DATANODE,32GB The above memory can provide sufficient space for expansion.

When your Hadoop cluster grows more than 20 machines, we recommend configuring the initial cluster, multiple cabinets, each cabinet top organic cabinet Gigabit switch, which connects Gigabit Ethernet or Unlimited network (Infiniband).

Software Solution Selection

Operating

Hodoop has the flexibility to support Windows, Linux, and Unix operating systems, but the Linux operating system is the most commonly used option in terms of actual deployment. Among them, Linux system has a lot of release version, we recommend the choice of enterprise-class operating system centos6.3x64 to give full play to the hardware platform application capabilities.

Hadoop software

We recommend that industry users use tested and validated formal commercial distributions, and in this reference design we use the Intel Hadoop release as system software to successfully deploy operations in a customer's production environment to ensure greater value for the Hadoop cluster.

Development tools

Hadoop development tools are very rich, and customers can choose from different needs:

Hive (Data Warehouse): A large data warehouse engine based on Hadoop. It can store data in a distributed file system or distributed database, and use SQL language to do massive data statistics, query and analysis operations.

Zookeeper (Collaborative services): For a large distributed system of reliable coordination system, the functions provided include: Configuration maintenance, Name service, distributed synchronization, group services. It maintains system configuration, group user, and naming information.

Pig (data Processing): It is a large data distributed data analysis language and running platform based on Hadoop. Its architecture ensures that the analysis tasks can be distributed in parallel to meet the needs of massive data analysis.

Mahout (Data Mining): Extensible machine Learning Class library, combined with Hadoop can provide distributed data analysis capabilities.

Flume (Log Collection Tool): distributed, highly reliable and highly available log capture system, which is used to collect, summarize and move large volumes of log data into a centralized data store from different source systems.

Sqoop (relational Data ETL tool): A connector component that provides efficient two-way transfer of data between Hadoop and structured data sources.

Management tools

Hadoop cluster applications are complex, and organizations often rely on enterprise-level support services to ensure high performance, reliability, and availability. Intel Manager for Hadoop software Manager (Intel Manager) is a powerful and easy-to-use management software that simplifies setup, management, security, and troubleshooting of Hadoop clusters, Enterprise IT staff can focus on getting the most business value from the Hadoop environment without worrying about cluster management issues.

Energy Management-DCM Data center management Platform Introduction

Intel®data Center Manager, (DCM) is a software technology product that monitors, manages, and optimizes data Center Server group power and temperature. Designed to address the following energy efficiency challenges facing data centers:

Many data centers have run out of power resources.

The cooling system design is imperfect, which leads to temperature hot spot and reduce the cabinet density.

In order to realize the power monitoring function, it is necessary to purchase independent intelligent sockets based on IP address access, they are very expensive.

Unable to obtain accurate actual power consumption data, resulting in overly conservative planning and waste of resources.

The current design is inefficient at low load: Even if the server is idle, it consumes 50% of its maximum power consumption.

Different OEMs support different proprietary power measurement and control protocols, which makes it difficult to manage all the devices in the datacenter through a single solution.

DCM can monitor and manage the overall power consumption of the server in a out-of-band way without affecting the operation of the server system, and put forward the reasonable energy-saving measures for the actual environment and the server operation situation by analyzing the historical data. Using the Intel DCM Energy control technology, according to the energy limit that the data center can provide, by adjusting the CPU running frequency, the memory running frequency, the backup server is in the lowest power state, and the power restriction policy is implemented for the whole system.

DCM Console Product Introduction

The DCM Console is a server energy management software based on the network graphical User (GUI) interface, which can provide data center energy management functions based on the DCM Software Development Kit (SDK).

Characteristics and value of DCM console

Monitoring

Real-time monitoring of the actual power consumption and inlet temperature data of cabinets, units, engine rooms, user-defined physical/logical groups

Receive alerts based on custom power and temperature events

Power budget engine for legacy servers that lack power monitoring

Monitoring Cisco EnergyWise Switch Power consumption

Displays server tags and serial numbers for HP, IBM, and Dell brands

Support for Cisco Cabinet servers and UCS systems

Indicates server cooling effect

Trend Analysis

Recording power and temperature data and using filters to query trend data

Historical data can be stored for up to 1 years for resource planning

Control

Patented Smart Group Policy engine

Multiple concurrent effective power policy types can be supported at multiple tiered levels

Workload priority can be used as policy directives

Allow scheduling policies (including power consumption caps) according to time/or week

Can meet the power limit of a server group when dynamically adjusting to changing server load

Intel's Node Manager 2.0 technology, which supports memory power limits and dynamic CPU kernel allocations

No agent

No software agents are installed on the pipe node

Easy to integrate and co-exist

Using IP address ranges to find devices

Support for advanced Web Services Description Language (WSDL) APIs

Can reside on a standalone management server or coexist with an ISV on the same server

Power/temperature perception, flexible dispatch-airflow channel and outlet temperature, modeling (requires OEM support)

Export temperature sensor (requires OEM support)

Scalability

Can manage up to 10,000 pipe nodes

Security

Adopt An API that contains security features

Secure communication with the pipe node

Encrypt all sensitive data

The main features of the Intel Data Center management platform include:

Power monitoring: Based on the equipment, cabinets, rows, computer rooms and data centers to monitor the different levels of power consumption related indicators.

Temperature monitoring: Monitor DC temperature in real time.

Power control: Implement policies for devices and groups, limiting data center power consumption.

Device lookup: Find supported devices on the network, including blades, cabinet servers, chassis, and part of the Power distribution unit (that is, distribution units, hereinafter referred to as PDU), and an uninterrupted supply (that is, the uninterruptible-Supply, Hereinafter referred to as UPS).

Event Management: Monitors and manages events for groups or devices.

Scalability of reference schemes

Scale-out Deployment:

In practical applications, in order to data processing more and faster, we need to increase the number of server clusters, from a cabinet to multiple cabinets. The deployment of multiple cabinet deployments can be extended very easily based on the design of a single cabinet deployment designed for this reference.

Extensibility of performance

Limited by experimental equipment, we are often not able to perform full deployment performance testing for large deployments.

However, our assessment confirms that the performance of the Hadoop cluster can grow linearly with the increase in the number of server nodes, so that the performance of the full deployment can be estimated with a small or partial server deployment test.

The following tests are the result of Intel's HDFS scan performance testing of 2 to 64 Datanode deployments in the lab:

Intel Hadoop HDFS Scan profiling diagram

The Blue Performance curve formula in the figure is: HDFS scan performance (m/s) =103.23* node number +206.23

The calculated results of the formula are in agreement with the actual test results (correlation coefficient r2〉0.99). It is proved that the larger cluster performance can be estimated by testing a small number of nodes.

In the process of implementing this reference design, customers can use the same method to obtain test data for a small number of node tests, and then obtain an empirical performance extension formula to estimate the larger cluster performance.

*: Test based on Intel R2308GL4G platform, 2 xeone5-2640 processors, 48GB DDR3 memory, 6 SATA 6GB HDDs (7200rpm), dual gigabit NIC teaming

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More