1.hadoop2.0 Brief Introduction [1]
Compared with the previous stable hadoop-1.x, Apache Hadoop 2.x has a significant change. This gives improvements in both HDFs and MapReduce.
HDFS: In order to maintain the scale level of name servers, developers have used multiple independent namenodes and namespaces. These namenode are united, and they do not need to be co-ordinated with each other. Datanode can store data blocks for all Namenode, and each block of data is registered on all Namenode on the platform. Datenode regularly sends heartbeat signals and data reports to Namenode, accepting and processing namenodes commands.
YARN (Next Generation MapReduce): The new architecture introduced in hadoop-0.23, divides the two main functions of Jobtracker: Resource management and job lifecycle management into different parts. The new Resource manager is responsible for managing application-oriented computing resource allocation and scheduling and coordination between each application.
Each new application is both a traditional mapreduce job and a DAG for these jobs (database availability Group data availability groups). The Resource Manager (Resourcesmanager) and the Data Manager (NodeManager) that manages each machine form the computing layout for the entire platform.
The application manager for each application is actually a database of schemas, requesting resources from resource managers (Resourcesmanager), Data Managers (NodeManager) for execution and monitoring tasks.
2. Directory Structure of Hadoop2.0 [2]
The directory structure of Hadoop2.0 is much like the directory structure of the Linux operating system, and the functions of each directory are as follows:
(1) In the new version of Hadoop, users who use Hadoop are divided into different groups of users, just like Linux. Therefore, the execution files and scripts are divided into two parts, which are stored in the bin and sbin directories respectively. The Sbin directory is a script that only Superuser (Superuser) has permission to execute, such as start-dfs.sh, start-yarn.sh, stop-dfs.sh, Stop-yarn.sh, these are operations on the entire cluster, only superuser have permission. And the script stored in the bin directory all users have permission to execute, the script here is generally the specific files in the cluster or the block pool operation commands, such as uploading files, view the use of the cluster and so on.
(2) in the ETC directory is stored in the 0.23.0 before the Conf directory of things, that is, Common, HDFs, MapReduce (yarn) configuration information.
(3) in the Include and Lib directories, a library of header files and links developed using the C language interface of Hadoop is stored.
(4) libexec directory is stored in the Hadoop configuration script, how to use these scripts, I have not been traced. I am currently adding the JAVA_HOME environment variable in the hadoop-config.sh file.
(5) Logs directory in the download to the installation package is not, if you install and run Hadoop, will generate logs this directory and the inside of the log.
(6) Share this folder contains the doc document and the most important source code for Hadoop generated by the jar package file, which is all the jar packages that are used to run Hadoop.
3. Learn about Hadoop configuration files [3]
(1) Dfs.hosts record the list of machines that will be added as Datanode to the cluster
(2) Mapred.hosts record the list of machines that will be added as Tasktracker to the cluster
(3) Dfs.hosts.exclude Mapred.hosts.exclude contains list of machines to be removed respectively
(4) Master record machine list for running auxiliary namenode
(5) Slave records the list of machines running Datanode and Tasktracker
(6) Hadoop-env.sh record the environment variables to be used by the script to run Hadoop
(7) Core-site.xml Hadoop core configuration items, such as HDFs and MapReduce common I/O settings.
(8) Hdfs-site.xml configuration items for the Hadoop daemon, including Namenode, auxiliary namenode, and Datanode.
(9) Configuration items for the Mapred-site.xml mapreduce daemon, including Jobtracker and Tasktracker.
Hadoop-metrics.properties controls how metrics properties are published on Hadoop.
Log4j.properties the properties of the system log file, the Namenode audit log, and the task log of the Tasktracker child process.
4. Hadoop detailed configuration [4,5]
Download hadoop-2.0.0-alpha.tar.gz from the Hadoop official website, put it in a shared folder, unzip it in/usr/lib, run tar-zxvf/mnt/hgfs/share/ Hadoop-2.0.0-alpha.tar.gz.
(1) Edit in Gedit ~/.BASHRC:
Export hadoop_prefix= "/usr/lib/hadoop-2.0.0-alpha"
Export path= $PATH: $HADOOP _prefix/bin
Export path= $PATH: $HADOOP _prefix/sbin
Exporthadoop_mapred_home=${hadoop_prefix}
Export Hadoop_common_home=${hadoop_prefix}
Export Hadoop_hdfs_home=${hadoop_prefix}
Export Yarn_home=${hadoop_prefix}
Still save the exit, and then SOURCE~/.BASHRC to make it effective.
(2) Edit the Core-site.xml in the Etc/hadoop directory
(3) Edit the Hdfs-site.xml in the Etc/hadoop directory
Path
File:/home/hadoop/workspace/hadoop_space/hadoop23/dfs/name and
File:/home/hadoop/workspace/hadoop_space/hadoop23/dfs/data
is a folder in the computer, the path for storing data and editing the file must be described with a detailed URI.
(4) Create a file in/etc/hadoop using the following content Mapred-site.xml
Path:
File:/home/hadoop/workspace/hadoop_space/hadoop23/mapred/system and
File:/home/hadoop/workspace/hadoop_space/hadoop23/mapred/local
The path to the folder where the data is stored in the computer must be described in a detailed URI.
(5) Edit Yarn-site.xml
(6) Create the hadoop-env.sh in the/etc/hadoop directory and add:
Another, need yarn-env.sh
5: Detailed description of the configuration
1:gedit/etc/profile
Export hadoop_prefix= "/home/hadoop"
Export path= $PATH: $HADOOP _prefix/bin
Export path= $PATH: $HADOOP _prefix/sbin
Exporthadoop_mapred_home=${hadoop_prefix}
Export Hadoop_common_home=${hadoop_prefix}
Export Hadoop_hdfs_home=${hadoop_prefix}
Export Yarn_home=${hadoop_prefix}
2:config
-----gedit Etc/hadoop/core-site.xml
<configuration>
<property>
<name>io.native.lib.available</name>
<value>true</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
<description>thename of the default file system. Either the literal string "local" Ora Host:port for ndfs.</description>
<final>true</final>
</property>
</configuration>
-----gedit Etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/workspace/hadoop_space/hadoop23/dfs/name</value>
<description>determineswhere on the local filesystem the DFS name node should store the name table. Ifthis is a comma-delimited list of Directories,then name table is replicated inall of the Directories,for redundancy.< /description>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/workspace/hadoop_space/hadoop23/dfs/data</value>
<description>determineswhere on the local filesystem a DFS data node should store its blocks. If thisis a comma-delimited list of directories,then data would be stored in all named Directories,typicallyon different de Vices. Directories that does not exist is ignored.
</description>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
-----gedit Etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.job.tracker</name>
<value>hdfs://127.0.0.1:9001</value>
<final>true</final>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<