Downloading and installing hadoop

Source: Internet
Author: User
Tags apache download configuration settings xsl
Downloading and installing hadoop

 

Hadoop can be downloaded from one of the Apache download mirrors
. You may also download a nightly build
Or check out the code from subversion
And build it with ant
. Select a directory to install hadoop under (let's say/Foo/BAR/hadoop-install
)
And untar the tarball in that directory. A directory corresponding
The version of hadoop downloaded will be created under/Foo/BAR/hadoop-install
Directory. For instance, if version 0.6.0 of hadoop was downloaded untarring as described abve will create the Directory/Foo/BAR/hadoop-install/hadoop-0.6.0
. The examples in this document assume the existence of an environment variable$ Hadoop_install
That represents the path to all versions of hadoop installed. In the above instanceHadoop_install =/Foo/BAR/hadoop-install
. They further assume the existence of a symlink namedHadoop
In$ Hadoop_install
That points to the version of hadoop being used. For instance, if version 0.6.0 is being used then$ Hadoop_install/hadoop> hadoop-0.6.0
. All tools used to run hadoop will be present in the directory$ Hadoop_install/hadoop/bin
. All configuration files for hadoop will be present in the directory$ Hadoop_install/hadoop/Conf
.

 

Startup scripts

 

The$ Hadoop_install/hadoop/bin
Directory contains some scripts used to launch hadoop DFS and hadoop MAP/reduce daemons. These are:

  • Start-all.sh
    -Starts all hadoop daemons, The namenode, datanodes, The jobtracker and tasktrackers.

  • Stop-all.sh
    -Stops all hadoop daemons.

  • Start-mapred.sh
    -Starts the hadoop MAP/reduce daemons, The jobtracker and tasktrackers.

  • Stop-mapred.sh
    -Stops the hadoop MAP/reduce daemons.

  • Start-dfs.sh
    -Starts the hadoop DFS daemons, The namenode and datanodes.

  • Stop-dfs.sh
    -Stops the hadoop DFS daemons.

It is also possible to run the hadoop daemons as Windows services using the Java service wrapper
(Download this separately). This still requires cygwin to be installed
As hadoop requires its DF command. See the following Jira issues
Details:

  • Https://issues.apache.org/jira/browse/HADOOP-1525

  • Https://issues.apache.org/jira/browse/HADOOP-1526

 

Configuration Files

 

The$ Hadoop_install/hadoop/Conf
Directory contains some configuration files for hadoop. These are:

  • Hadoop-env.sh
    -This file contains some environment variable settings used by hadoop.
    You can use these to affect some aspects of hadoop daemon behavior,
    Such as where log files are stored, the maximum amount of heap used
    Etc. The only variable you shoshould need to change in this file isJava_home
    , Which specifies the path to the Java 1.5.x installation used by hadoop.

  • Slaves
    -This file lists the hosts, one per line, where the hadoop slave
    Daemons (datanodes and tasktrackers) will run. By default this contains
    The Single EntryLocalhost

  • Hadoop-default.xml
    -This file contains generic default settings for hadoop daemons and MAP/reduce jobs.Do not modify this file.

  • Mapred-default.xml
    -This file contains site specific settings for the hadoop MAP/reduce
    Daemons and jobs. The file is empty by default. Putting Configuration
    Properties in this file will override MAP/reduce settings inHadoop-default.xml
    File. Use this file to tailor the behavior of MAP/reduce on your site.

  • Hadoop-site.xml
    -This file contains site specific settings for all hadoop daemons and
    MAP/reduce jobs. This file is empty by default. settings in this file
    Override those inHadoop-default.xml
    AndMapred-default.xml
    .
    This file shoshould contain settings that must be respected by all servers
    And clients in a hadoop installation, for instance, the location of
    Namenode and the jobtracker.

More details on configuration can be found on the howtoconfigure
Page.

 

Setting up hadoop on a Single Node

 

This
Section describes how to get started by setting up a hadoop cluster on
A single node. The setup described here is an HDFS instance with
Namenode and a single datanode and a MAP/reduce cluster with
Jobtracker and a single tasktracker. The configuration procedures
Described in basic configuration are just as applicable for larger
Clusters.

 

Basic Configuration

 

Take
A pass at putting together basic configuration settings for your
Cluster. Some of the settings that follow are required, others are
Recommended for more straightforward and predictable operation.

  • Hadoop Environment Settings
    -Ensure thatJava_home
    Is set inHadoop-env.sh
    And points to the Java installation you intend to use. You can set other environment variables inHadoop-env.sh
    To suit your requirments. Some of the default settings refer to the variableHadoop_home
    . The valueHadoop_home
    Is automatically inferred from the location of the startup scripts.Hadoop_home
    Is the parent directory ofBin
    Directory that holds the hadoop scripts. In this instance it is$ Hadoop_install/hadoop
    .

  • Jobtracker and namenode settings
    -Figure out where to run your namenode and jobtracker. Set the variableFS. Default. Name
    To the namenode's intended host: port. Set the variableMapred. Job. Tracker
    To the jobtrackers intended host: port. These settings shocould be inHadoop-site.xml
    . You may also want to set one or more of the following ports (also inHadoop-site.xml
    ):

    • DFS. datanode. Port

    • Dfs.info. Port

    • Mapred.job.tracker.info. Port

    • Mapred. task. tracker. Output. Port

    • Mapred. task. tracker. Report. Port

  • Data path settings
    -Figure out where your data goes. This includes des settings for where
    Namenode stores the namespace checkpoint and the edits log, where
    Datanodes store filesystem blocks, storage locations for MAP/reduce
    Intermediate output and temporary storage for the HDFS client.
    Default values for these paths point to various locations in/Tmp
    . While this might be OK for a single node installation, for larger clusters storing data in/Tmp
    Is not an option. These settings must also be inHadoop-site.xml
    . It is important for these settings to be present inHadoop-site.xml
    Because they can otherwise be overridden by client Configuration
    Settings in MAP/reduce jobs. Set the following variables to appropriate
    Values:

    • DFS. Name. dir

    • DFS. Data. dir

    • DFS. Client. Buffer. dir

    • Mapred. Local. dir

An example ofHadoop-site.xml
File:

 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:54311</value>
</property>
<property>
<name>dfs.replication</name>
<value>8</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
</configuration>

 

 

Formatting the namenode

 

The first
Step to starting up your hadoop installation is formatting the hadoop
Filesystem, which is implemented on top of the local filesystems
Your cluster. You need to do this the first time you set up a hadoop
Installation.Do not
Format a running hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure thatDFS. Name. dir
Directory exists. If you just used the default, thenMkdir-P/tmp/hadoop-Username/dfs/Name
Will create the directory. to format the filesystem (which simply initializes the directory specified byDFS. Name. dir
Variable), run the command:
% $ Hadoop_install/hadoop/bin/hadoop namenode-format

 

Starting a single node cluster

 

Run the command:
% $ Hadoop_install/hadoop/bin/start-all.sh

This will startup a namenode, datanode, jobtracker and a tasktracker on your machine.

 

Stopping a single node cluster

 

Run the command
% $ Hadoop_install/hadoop/bin/stop-all.sh

To stop all the daemons running on your machine.

 

Separating configuration from installation

 

In
Example described abve, the configuration files used by the hadoop
Cluster all lie in the hadoop installation. This can become cumbersome
When upgrading to a new release since all custom config has to be
Re-created in the new installation. It is possible to separate
Config from the install. To do so, select a directory to house hadoop configuration (let's say/Foo/BAR/hadoop-config
. Copy all conf files to this directory. You can either setHadoop_conf_dir
Environment variable to refer to this directory or pass it directly to the hadoop scripts with-- Config
Option. In this case, the cluster Start and Stop commands specified in the above two sub-sections become
% $ Hadoop_install/hadoop/bin/start-all.sh -- config/Foo/BAR/hadoop-config
And
% $ Hadoop_install/hadoop/bin/stop-all.sh -- config/Foo/BAR/hadoop-config
.
Only the absolute path to the config directory shoshould be passed to the scripts.

 

Starting up a larger cluster

 

  • Ensure
    That the hadoop package is accessible from the same path on all nodes
    That are to be added in the cluster. If you have separated
    Configuration from the install then ensure that the config directory is
    Also accessible the same way.
  • PopulateSlaves
    File with the nodes to be encoded in the cluster. One node per line.

  • Follow the steps inBasic Configuration
    Section above.

  • Format the namenode
  • Run the command% $ Hadoop_install/hadoop/bin/start-dfs.sh
    On the node you want the namenode to run on. This will bring up HDFS
    With the namenode running on the machine you ran the command on and
    Datanodes on the machines listed in the slaves file mentioned above.

  • Run the command% $ Hadoop_install/hadoop/bin/start-mapred.sh
    On the machine you plan to run the jobtracker on. This will bring up
    The MAP/reduce cluster with jobtracker running on the machine you ran
    The command on and tasktrackers running on machines listed in
    Slaves file.

  • The above two commands can also be executed with-- Config
    Option.

 

Stopping the Cluster

The cluster can be stopped by running% $ Hadoop_install/hadoop/bin/stop-mapred.sh
And then% $ Hadoop_install/hadoop/bin/stop-dfs.sh
On your jobtracker and namenode respectively. These commands also accept-- Config
Option.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.