The environment for this configuration is the Hadoop1.2.1 version, and Hadoop introduced the Hadoop2.0 version in 13, which was modified on the basis of the Hadoop1.0 release to improve the efficiency of Hadoop cluster task scheduling, resource allocation, and fault handling.
Hadoop2.0 on the basis of Hadoop1.0, the first to make a change to HDFs, in Hadoop1.0, HDFs system Namenode node only allow 1, of course, in the GFS thesis, the cluster shadow hidden a shadownode, as Namenode backup, should be Hadoop1.0 configured in the Secondarynamenode Bar, In Hadoop2.0, there can be multiple Namenode in the HDFS system, which are independent of each other, Datanode register messages to all Namenode, thus enhancing the system's level of scalability and the availability of the system.
Hadoop2.0 Another change is a change to the MapReduce runtime framework, in Hadoop1.0, jobclient to the Master server after the task, the task is divided into different tasks by Jobtracker, submitted to the Slaver server for calculation, Master The server's tasks include assignment of tasks, allocation of resources, tracking of task execution, and processing after failed task execution, all of which are concentrated on the master server, resulting in a single point of failure and increased probability of task allocation failure. In Hadoop2.0, through the improvement of resource management and task management, the master node only allocates resources and monitors the state of work, other such as work partition, task status detection and so on to slaver node, this is the latest yarn framework, the system structure as shown:
It can be seen that the master server is only responsible for running ResourceManager, responsible for the management of resources, the specific tasks are managed by Applicationmanager, which include task allocation, status tracking and error handling.
The Hadoop environment is mostly configured with several files: Core-site.xml,hdfs-site.xml,mapred-site.xml and Yarn-site.xml.
Core-site.xml inside the main configuration of the cluster's task submission address.
Hdfs-site.xml fill in the relevant configuration of the HDFS system, including the location of the directory files and data files.
Mapred-site.xml: Configures the local location of the Jobtracker port, the intermediate result of the map operation.
Yarn-site.xml: This is a special configuration file in the HADOOP2, is the configuration of yarn framework, specific configuration information on the Hadoop official website, but will not be filled in.
Configuring the HADOOP1 environment is pretty straightforward, but the content in the HADOOP2 configuration file changes significantly, configuring 2.6 but not successful.
Using HADOOP1 cluster, made some performance test, HADOOP1 cluster environment is a master,2 station slaver, each slaver is single core, 2G memory configuration, test program is the WordCount use case of Hadoop comes with.
Locally generated 134 trillion of files, a single file, through the byte-written word statistics program, the statistics of a file, time spent 17 seconds, in the cluster, a single file time is 1 minutes 44 seconds. From the results of a single file, Hadoop does not reflect the performance he should have, and later tested 10 such files, the local time 3 minutes 34 seconds, the cluster first 3 minutes 44 seconds, the performance has been quite close, and then consider the cluster two slaver configuration add up also no I a notebook configuration high , so the result can be accepted, after the first job executes, immediately executes the second time, the result is 2 minutes 50 seconds, this result likes:
Hadoop Cluster Environment configuration