There are more articles on the Hadoop reference Design group components and critical steps, so the small set of Hadoop reference Design group components and key steps are divided into sections to give you a detailed introduction.
Software
Operating system: Hadoop supports any operating system that can run the Java environment. In practical applications, the average customer will choose the 64-bit versions of different Linux distributions. In this reference design we chose the free enterprise-class Linux CentOS6.3 x64 version.
Hadoop system: Hadoop is an open-source software based on the Apache licensing protocol and allows customers to choose between free open source and commercial support editions. Free Open source version There are still a large number of software bugs, before the use of a certain amount of software development efforts to verify and improve. In large data industry applications, a more mature and reliable version of the business support is generally recommended. Intel provides its own Apache hadoop* Intel distribution and includes a number of performance optimizations and improvements to industry application needs, so we use Intel's hadoop* release in our reference designs.
Internet
The Hadoop system has the flexibility to support different Ethernet technologies. The Apache hadoop* Intel distribution also adds support for unlimited network (Infiniband) technology.
Typically, gigabit and Gigabit Ethernet applications are common in large industry data applications. Our reference design also uses these Ethernet technologies.
Unlimited Network (Infiniband) technology is used in the implementation of Hadoop for data storage or processing of low latency has special requirements of the occasion.
Key steps in reference design implementation
Hardware Device Deployment
Cabinet deployment
As mentioned earlier, the deployment of the Hadoop scheme is typically in cabinets. In a cabinet it usually contains 1 to 2 switches, multiple servers and corresponding cabinet distribution (PDU).
In our installation simulation environment, we use only 4 servers. The process of installation is essentially the same as a larger deployment (but more data node nodes need to be installed repeatedly), only slightly different on the network design to connect more servers.
Network Connections
In the experimental system, each server (Intel s host has 6 Ethernet ports, from the back of the sequence is eth0 to Eth5, where eth0 to Eth3 is the speed of 1G network port, Eth4 and Eth5 is the speed of 10G network port. In a pre-configured configuration, all servers are connected to the same local area network, and any network port (other than the management port) can be selected to meet the needs.
Screen diagram:
Network environment topology:
Software deployment
The following figure depicts the components of the Apache hadoop* Intel distribution
Distributed File System (HDFS)
Distributed Database (HBase)
Distributed Data Warehouse (Hive)
Distributed data Analysis (PIG)
Parallel Computing Framework (MAPREDUCE)
Distributed synchronization software (zookeeper)
Data Mining (Mahout)
Structured data connectors (SQOOP)
Log data connectors (Flume)