Hadoop is a highly scalable, large data application that can handle dozens of TB to hundreds of PB of data through fewer than thousands of interconnected servers. This reference design realizes a single cabinet of Hadoop cluster design, if users need more than one cabinet of Hadoop cluster, can expand the design of the number of servers and network bandwidth easy to achieve expansion.
Hadoop scenario
Design of integrated Hadoop machine
Features of the Hadoop scheme
Hadoop is a low-cost and highly scalable large data processing platform. Hadoop provides a stable shared storage and analysis system, which is implemented by HDFS (distributed data storage), and data processing is implemented by MapReduce (distributed processing), while Hadoop provides hbase as a real-time database and numerous application tools. The Hadoop system is a distributed platform that consists of hundreds of servers, with some data stored on each server, and some data operations performed.
The composition of Hadoop cluster system
Hadoop server Role
HDFS (Distributed data storage):
Distributed file system with high tolerance and high throughput for large-scale data. It can be built from several to thousands of clusters of regular servers, and provides file read and write access to high aggregate input and output.
Main Features:
Building high reliability and fault-tolerant systems with low-cost storage and servers, automatic data replication, self-healing
Supports GB to TB level large data files, providing PB-level storage capacity
Streamline "consistency" for streaming data access optimizations for one-write, multiple-read
High aggregation bandwidth, high concurrent access
Moving "computing" is cheaper than moving "data", providing the ability to store and compute the same node data
Name node and Datanode
A HDFs cluster is composed of a namenode and multiple datanodes.
Namenode is a central server responsible for managing the file system's namespace (namespace) and client access to files, and is the controller and manager of all HDFs metadata. Namenode performs namespace operations on file systems, such as opening, closing, renaming files, or directories.
Datanode is typically a node that manages the storage on its node. Datanode is responsible for processing and reading requests from file system clients.
Map Reduce (Distributed data Processing):
A distributed framework for large data processing can be distributed synchronously across a server cluster. It is designed for off-line data analysis, uses the data parallelism to carry on the distribution operation, then summarizes the result computation frame.
Basic Features:
Splitting, distributing, and summarizing tasks in the framework logic, developers only need to implement business logic
Distributed task automatic failure retry, single task unexpected failure does not cause entire task to launch
and HDFs consolidation to move the calculation to the node where the data resides
Jobtracker is one of the most important classes in the MapReduce framework, where all job execution is scheduled and only one jobtracker or one job tracker is configured in the Hadoop system plus one backup Jobtracker Implement the HA application of MapReduce. They are scheduled by a master service Jobtracker and multiple slaver services running on multiple nodes Tasktracker two classes.
Hbase (Distributed database)
HBase is a distributed, storage-by-column, multidimensional table-structured real-time distributed database. It can provide high speed data structure and unstructured data reading and writing operations, designed for high speed online service. Main Features:
Supports tens of thousands of levels of high speed concurrent writes and high concurrent queries per second
Scalable, automatic data segmentation and distribution, dynamic expansion, no downtime
Data is stored on the HDFs Distributed File system and is not lost
Flexible table structure that can be dynamically changed and added (including rows, columns, and timestamps)
Column-oriented, compressible, effectively reducing disk I/O, increasing utilization
Multidimensional table, four dimensions, where three dimensions are variable, suitable for describing complex nesting relationships
Network interconnection:
The Hadoop cluster structure consists of a two-tier network topology. To get the most out of Hadoop, it's important to configure Hadoop, which includes the network topology. For multiple cabinet clusters, we need to map nodes to cabinets, and by mapping, when placing mapreduce tasks in the node, Hadoop will take precedence over the cabinet transmission rather than the outside of the cabinet. HDFs can place replicas more intelligently, weighing on performance and resilience. Network locations, such as nodes and cabinets, can be represented as a tree that reflects the "distance" between locations in the network. The name node uses a network location when deciding where to store a copy of the block, and when a map task is assigned to a tasktracker, the Jobtracker node uses the network location to determine where the most recent copy is entered as a map task.
Gigabit and Gigabit Ethernet is currently the most commonly used in Hadoop network technology, in the cabinet, the use of Gigabit Ethernet connections to each node, and cabinets can be interconnected through Gigabit Ethernet. Future, with the cost of Gigabit Ethernet down, Gigabit Ethernet will be used in the cabinet-level exchange above, Hadoop itself can support other network interconnection technology, such as the Unlimited Network (Infiniband) for the need for a very low latency application requirements, but usually Ethernet can meet the majority of customer applications.
Hardware platform Selection
Hadoop does not need to run on expensive and highly reliable hardware. It is designed to run on a common dual server cluster and uses a large number of low-cost SATA hard drives, and its performance IO and data processing are implemented by aggregate performance, enabling better processing power or storage performance by reasonably expanding the number of nodes in the cluster or increasing the number of hard disks. At the same time, the Hadoop system realizes the fault-tolerant to the hardware, the data stored in the cluster or the processing task is not lost because of the fault of the individual hardware. This design further reduces the reliance on special hardware fault-tolerant technologies and lowers deployment costs.
For the various functions of the Hadoop cluster, consider the following server and network design:
Hadoop Server Design Requirements
Namenode is responsible for coordinating data storage in the cluster, Jobtracker coordinating data computing tasks, the final node type is secondly Namenode, small cluster it can and Namenode share a machine, Larger clusters can use the same hardware as the Namenode nodes, which require fast response, low latency, and high reliability, and we recommend that customers use the Intel Xeon E5 dual platform server to run Namenode, secondly Namenode and JOBTRACKERS,48GB memory, with SSD local storage and Enterprise-class RAID10 disks.
For a cluster of 100 datanodes, which requires processing capabilities to match I/O performance, large storage capacity, and high network bandwidth requirements, we recommend using the Intel Xeon E5 dual platform server to run DATANODE,32GB The above memory can provide sufficient space for expansion.
When your Hadoop cluster grows more than 20 machines, we recommend configuring the initial cluster, multiple cabinets, each cabinet top organic cabinet Gigabit switch, which connects Gigabit Ethernet or Unlimited network (Infiniband).
Software Solution Selection
Operating
Hodoop has the flexibility to support Windows, Linux, and Unix operating systems, but the Linux operating system is the most commonly used option in terms of actual deployment. Among them, Linux system has a lot of release version, we recommend the choice of enterprise-class operating system centos6.3x64 to give full play to the hardware platform application capabilities.
Hadoop software
We recommend that industry users use tested and validated formal commercial distributions, and in this reference design we use the Intel Hadoop release as system software to successfully deploy operations in a customer's production environment to ensure greater value for the Hadoop cluster.
Development tools
Hadoop development tools are very rich, and customers can choose from different needs:
Hive (Data Warehouse): A large data warehouse engine based on Hadoop. It can store data in a distributed file system or distributed database, and use SQL language to do massive data statistics, query and analysis operations.
Zookeeper (Collaborative services): For a large distributed system of reliable coordination system, the functions provided include: Configuration maintenance, Name service, distributed synchronization, group services. It maintains system configuration, group user, and naming information.
Pig (data Processing): It is a large data distributed data analysis language and running platform based on Hadoop. Its architecture ensures that the analysis tasks can be distributed in parallel to meet the needs of massive data analysis.
Mahout (Data Mining): Extensible machine Learning Class library, combined with Hadoop can provide distributed data analysis capabilities.
Flume (Log Collection Tool): distributed, highly reliable and highly available log capture system, which is used to collect, summarize and move large volumes of log data into a centralized data store from different source systems.
Sqoop (relational Data ETL tool): A connector component that provides efficient two-way transfer of data between Hadoop and structured data sources.
Management tools
Hadoop cluster applications are complex, and organizations often rely on enterprise-level support services to ensure high performance, reliability, and availability. Intel Manager for Hadoop software Manager (Intel Manager) is a powerful and easy-to-use management software that simplifies setup, management, security, and troubleshooting of Hadoop clusters, Enterprise IT staff can focus on getting the most business value from the Hadoop environment without worrying about cluster management issues.
Energy Management-DCM Data center management Platform Introduction
Intel®data Center Manager, (DCM) is a software technology product that monitors, manages, and optimizes data Center Server group power and temperature. Designed to address the following energy efficiency challenges facing data centers:
Many data centers have run out of power resources.
The cooling system design is imperfect, which leads to temperature hot spot and reduce the cabinet density.
In order to realize the power monitoring function, it is necessary to purchase independent intelligent sockets based on IP address access, they are very expensive.
Unable to obtain accurate actual power consumption data, resulting in overly conservative planning and waste of resources.
The current design is inefficient at low load: Even if the server is idle, it consumes 50% of its maximum power consumption.
Different OEMs support different proprietary power measurement and control protocols, which makes it difficult to manage all the devices in the datacenter through a single solution.
DCM can monitor and manage the overall power consumption of the server in a out-of-band way without affecting the operation of the server system, and put forward the reasonable energy-saving measures for the actual environment and the server operation situation by analyzing the historical data. Using the Intel DCM Energy control technology, according to the energy limit that the data center can provide, by adjusting the CPU running frequency, the memory running frequency, the backup server is in the lowest power state, and the power restriction policy is implemented for the whole system.
DCM Console Product Introduction
The DCM Console is a server energy management software based on the network graphical User (GUI) interface, which can provide data center energy management functions based on the DCM Software Development Kit (SDK).
Characteristics and value of DCM console
Monitoring
Real-time monitoring of the actual power consumption and inlet temperature data of cabinets, units, engine rooms, user-defined physical/logical groups
Receive alerts based on custom power and temperature events
Power budget engine for legacy servers that lack power monitoring
Monitoring Cisco EnergyWise Switch Power consumption
Displays server tags and serial numbers for HP, IBM, and Dell brands
Support for Cisco Cabinet servers and UCS systems
Indicates server cooling effect
Trend Analysis
Recording power and temperature data and using filters to query trend data
Historical data can be stored for up to 1 years for resource planning
Control
Patented Smart Group Policy engine
Multiple concurrent effective power policy types can be supported at multiple tiered levels
Workload priority can be used as policy directives
Allow scheduling policies (including power consumption caps) according to time/or week
Can meet the power limit of a server group when dynamically adjusting to changing server load
Intel's Node Manager 2.0 technology, which supports memory power limits and dynamic CPU kernel allocations
No agent
No software agents are installed on the pipe node
Easy to integrate and co-exist
Using IP address ranges to find devices
Support for advanced Web Services Description Language (WSDL) APIs
Can reside on a standalone management server or coexist with an ISV on the same server
Power/temperature perception, flexible dispatch-airflow channel and outlet temperature, modeling (requires OEM support)
Export temperature sensor (requires OEM support)
Scalability
Can manage up to 10,000 pipe nodes
Security
Adopt An API that contains security features
Secure communication with the pipe node
Encrypt all sensitive data
The main features of the Intel Data Center management platform include:
Power monitoring: Based on the equipment, cabinets, rows, computer rooms and data centers to monitor the different levels of power consumption related indicators.
Temperature monitoring: Monitor DC temperature in real time.
Power control: Implement policies for devices and groups, limiting data center power consumption.
Device lookup: Find supported devices on the network, including blades, cabinet servers, chassis, and part of the Power distribution unit (that is, distribution units, hereinafter referred to as PDU), and an uninterrupted supply (that is, the uninterruptible-Supply, Hereinafter referred to as UPS).
Event Management: Monitors and manages events for groups or devices.
Scalability of reference schemes
Scale-out Deployment:
In practical applications, in order to data processing more and faster, we need to increase the number of server clusters, from a cabinet to multiple cabinets. The deployment of multiple cabinet deployments can be extended very easily based on the design of a single cabinet deployment designed for this reference.
Extensibility of performance
Limited by experimental equipment, we are often not able to perform full deployment performance testing for large deployments.
However, our assessment confirms that the performance of the Hadoop cluster can grow linearly with the increase in the number of server nodes, so that the performance of the full deployment can be estimated with a small or partial server deployment test.
The following tests are the result of Intel's HDFS scan performance testing of 2 to 64 Datanode deployments in the lab:
Intel Hadoop HDFS Scan profiling diagram
The Blue Performance curve formula in the figure is: HDFS scan performance (m/s) =103.23* node number +206.23
The calculated results of the formula are in agreement with the actual test results (correlation coefficient r2〉0.99). It is proved that the larger cluster performance can be estimated by testing a small number of nodes.
In the process of implementing this reference design, customers can use the same method to obtain test data for a small number of node tests, and then obtain an empirical performance extension formula to estimate the larger cluster performance.
*: Test based on Intel R2308GL4G platform, 2 xeone5-2640 processors, 48GB DDR3 memory, 6 SATA 6GB HDDs (7200rpm), dual gigabit NIC teaming