Running Hadoop under Windows

Source: Internet
Author: User
Tags temporary file storage xsl rsync ssh access hadoop ecosystem

There are usually two ways to run Hadoop under Windows: One is to install a Linux operating system with a VM, which basically enables Hadoop to run in a full Linux environment, and the other is to emulate the Linux environment through Cygwin. The advantage of the latter is that it is easy to use and the installation process is simple. Let's take a look at the second scenario: How to quickly install a Hadoop environment under windows and Research and tune Hadoop code in conjunction with the Eclipse development environment.

The entire installation process consists of the following three main steps:

    1. Installing and configuring Cygwin (http://cygwin.com/install.html)
    2. Installing and configuring Hadoop-1.2.1 (http://hadoop.apache.org/docs/stable/cluster_setup.html)
    3. Installing and configuring the Eclipse development environment
1 Installation and Configuration Cygwin

Installing Cygwin to simulate a Linux environment and then installing Hadoop under Windows is a simple and convenient way to install the simulated Linux environment for Hadoop as follows:

1.1 Download the installation file

Download the appropriate installation files for different system types: http://cygwin.com/install.html.

My system here is window 7 below which is Setup-x86.exe

1.2 Installing Cygwin

The file you just downloaded is a software package download and management tool that simulates how Linux systems depend, and later you want to install or update it in a simulated Linux environment with this tool, and we'll run the tool as follows:

    1. Left-click the Setup-x86.exe file to run the Setup wizard:

Cygwin installation

    1. Click "Next" button to enter the program boot installation page, there are three options, select the first network installation:
      • Network installation: Download and install packages over the network
      • Download but not install: Download packages over the network
      • Local installation: Is installed with a local package

Cygwin installation

    1. Click "Next" to enter the root directory of the selected emulated Linux system and the User's wizard page. In the Linux file system there is only one root directory, here to choose the directory is the root directory in Linux, here choose the default: C:\cygwin; The user selects the first item: All valid users of the system.

Cygwin installation

    1. Click "Next" to select the local package directory, the tool will automatically remember and will be downloaded in the future all the packages are placed in the directory specified here. I choose here: C:\Users\Administrator\Desktop\1, if you choose not to have a directory, it is good to prompt to create a directory select Yes OK.

Cygwin installation

    1. Click "Next" to select your network connection, I use the proxy server to surf the Internet, so I choose the second item: Use IE browser proxy settings. Tested Select the third entry proxy server address and port, unable to access the network normally, for unknown reasons.

Cygwin installation

    1. Click "Next", wait to download the list of mirror sites, after the download is complete, select the site to download the package.

Cygwin installation

    1. According to their own situation to choose the appropriate, I have selected the domestic 163 site, click "Next", this tool will automatically download the package information list download completed after the installation package Selection page, such as:

Cygwin installation

    1. This step is important to ensure that the following packages are installed:

Cygwin installation

Note: This package list includes the following: classification, current installation version, latest version, installation executable file? , install the source code files? , size, package name and description.

      • Basic Package: base and all of the packages below, how to: Click Default after base to install.
      • SSH-related software packages: The OpenSSL and openssh for Hadoop requires SSH access, how to: click "+" to expand the Net node, click on the latest version number column before each package keep the version number is selected installation.
      • Other packages are selected to be installed according to their needs, and I've also chosen Emacs, VIM, Perl, Python, Ruby, Science, subversion and other common tools.
    1. After selecting the package click "Next" to go to the automatic download and installation, such as:

      Cygwin installation

    1. Click "Next" to go to the end page of the wizard, tick Create desktop shortcut click "Finish",

      Cygwin installation

Here, you have completed the installation of the simulated Linux environment, left-click the icon on the desktop to open the terminal window of the simulation Linux to enter a few common Linux commands to experience the simulation of the Linux system, in addition to the implementation of common Linux commands, You can also execute Windows commands such as: net start service_name, and so on. Continue with the following configuration work after the experience is complete.

1.3 Configuring the Cygwin SSH Service

After the Cygwin installation is complete, the SSH service needs to be configured to meet Hadoop's SSH password-free login process as follows:

Open a terminal that simulates Linux and enter the Linux environment

Execute command: Ssh-host-config

Hadoop Installation

The first time asked: "Should privilege separation be used?" (yes/no) ", enter no return.

Second query: "Do you want to install sshd a service?" and enter Yes to return.

Third prompt: "Enter the value of CYGWIN for the demon: []", enter directly.

Fourth question: "Does you want to use a different name?" (yes/no) ", enter no return.

Fifth prompt: "Please enter the password for user ' cyg_server ':", enter the password, enter the password here.

Finally, the configuration is complete.

1.4 Starting the SSH service

Execute the net start sshd or command cygrunsrv–s sshd to start the SSH service on the Linux terminal or Windows command line.

Test SSH Login to this machine:

Execute command on terminal: SSH localhost

Prompt to enter password: Enter the password, such as:

Hadoop installation

1.5 Configuring SSH password-free login

Execute command on terminal: Ssh-keygen-t dsa-p "-f ~/.SSH/ID_DSA generate secret key.

Execute command: Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys generate authentication file.

Execute command: SSH localhost tests whether you can log in without entering a password.

Hadoop installation

1.6Cygwin using accumulate 1.6.1 to access windows disk in Cygwin

Cd/cygdrive/c

1.6.2 consolidating Cygwin commands into Windows

Assuming that Cygwin is installed in D:/develop/cygwin, add D:/develop/cygwin/bin to the system variable path (preferably in front of windows, so that some of the same commands are executed with the Cygwin command first, Instead of Windows commands, such as Find).

Once added, you can execute the tar czvf xxx.tgz directly below the cmd.exe.

Basically all the commands are available, including ls,more,less,find,grep and so on.

1.6.3 using tgz Backup

Add the Cygwin bin to the path

Build a BAT file:

@echo off

D:

CD D:/website/8thmanage

Tar czvf 8thmanage.tgz 8thmanage

1.6.4Windows using shell scripts

Add the Cygwin bin to the path

In the $cygwin directory/var/a script t.sh, note that t.sh inside the path, are relative to the $cygwin, the inside need to access the C drive, please use/cygdrive/c/

Under Windows execution:

D:/cygwin/bin/bash d:/cygwin/var/t.sh

(Can be performed on a regular basis)

1.6.5 Synchronizing Windows system users

Mkpasswd-l >/etc/passwd

Mkgroup-l >/etc/group

If you have domain, you need to add-D domainname

1.6.6 Installation System Services

Cygrunsrv

Using rsync under 1.6.7cygwing
    1. Installing the Rsync component
    2. Enter Cygwin, configure the server

Vi/etc/rsyncd.conf

... screts File=/etc/tom.ipaddr.pas

Configuration file, refer to another article I wrote about rsync, note: Password file permissions must be 0400

chmod 0400/etc/tom.ipaddr.pas

    1. Start the service side

Rsync–daemon

    1. Client synchronization

Under the client's Cygwin, run rsync sync, specific commands, please refer to another rsync article.

Using sshd under 1.6.8cygwin
    1. Need to install the CYGRUNSRC,OPENSSH
    2. Run Ssh-host-config-y

Return to the cygwin=, enter the TTY Ntsec, and return to the

(or, add a system environment variable cugwin=nesec TTY)

    1. The SSHD service has been installed into your Windows service and can be started and shut down directly in the service.

(Cygrunsrc-s sshd or net start sshd)

1.6.9 Chinese Display

VI ~/.BASHRC

# Let LS and dir command display Chinese and color

Alias ls= ' Ls–show-control-chars–color '

Alias dir= ' Dir-n–color '

# Set the Chinese environment so that the hint becomes Chinese

Export lang= "ZH_CN. GBK "

# Output to Chinese encoding

Export output_charset= "GBK"

~/.INPUTRC for

Set Completion-ignore-case on

Set Meta-flag on

Set Output-meta. On

Set Convert-meta. Off

The Cygwin.bat script is:

@echo off

Set Make_mode=unix

2 Installing and configuring the Hadoop-1.2.12.1 installation JDK

Jdk:http://www.oracle.com/technetwork/java/javase/downloads/index.html

In particular, under Linux the path or command is strictly case-sensitive and for directories with spaces, add double quotation marks ("") in addition, it is recommended to put the JDK directly to the root directory of a disk, rather than the default installed program Files directory.

I do not download the latest JDK here, in the work machine to turn out N-long-unused 32-bit Windows jdk1.6.0_14, directly throw the jdk1.6.0_14 into the C-packing directory, and then configure the environment variables as follows:

Java_home=c:\jdk1.6.0_14

Path=%java_home%\bin, ..... Note: Add%java_home%\bin to the system path;

Open the Windows command line input java–version, you can perform normally OK.

I've tried not to configure Java environment variables under Windows, and Hadoop works as well, because we also know Java_home in the Hadoop run script.

Hadoop installation

2.2 Download the latest stable version of Hadoop

: Http://hadoop.apache.org/releases.html#Download

I downloaded the latest stable version: Hadoop-1.2.1-bin.tar.gz

2.3 Planning the Hadoop directory

In the Hadoop ecosystem, there are a variety of tools that may be used over time, including Hadoop development m/r, deploying M/R, and upgrading Hadoop, and so on, so it seems necessary to plan for the installation directory of Hadoop.

I installed Hadoop windows as a virtual machine and created only one partition, so the Hadoop folder was placed in the root of the C drive. Here is my directory structure:

Hadoop installation

Hadoop is located at the root of the C drive, with code (storage codes), deploy (installation files for Hadoop and Biosphere), SysData (store DFS data, Secondnamenode source data, and temporary file storage directory during runtime).

Extract the downloaded hadoop-1.2.1-bin.tar.gz to the directory c:\hadoop\deploy\hadoop-1.2.1

2.4 Adding Hadoop Basic configuration

The most complete configuration of Hadoop also has official documentation, and I've only configured some basic configuration here for reference:

Note: Because my Hadoop is installed on the C drive, the root of the configuration path in the following XML configuration file is the C packing directory. For example:/hadoop/sysdata in the process of translation is: C:\hadoop\sysdata. The root of the configuration path in the shell script is the root directory of the virtual Linux. For example:/cygdrive/c/jdk1.6.0_14 point to: c:\jdk1.6.0_14.

2.4.1conf/hadoop-env.sh

Add Java_home:

Export java_home=/cygdrive/c/jdk1.6.0_14

2.4.2conf/core–site.xml

<?xml version= "1.0″?>

<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://172.16.128.239:9001</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/hadoop/sysdata/</value>

<description>a base for other temporary directories.</description>

</property>

<property>

<name>fs.checkpoint.dir</name>

<value>/hadoop/sysdata/namesecondary/</value>

</property>

<property>

<name>dfs.web.ugi</name>

<value>lg,lg</value>

</property>

<property>

<name>fs.checkpoint.period</name>

<value>3600</value>

<description>set to 1 hour by default, specifies the maximum delay between both consecutive Checkpoints</descripti On>

</property>

<property>

<name>fs.checkpoint.size</name>

<value>67108864</value>

<description>set to 64MB By default, defines the size of the edits log file, forces an urgent checkpoint even if The maximum checkpoint delay is not reached.</description>

</property>

</configuration>

2.4.3hdfs-site.xml

<?xml version= "1.0″?>

<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>

<!–put Site-specific property overrides in this file. –>

<configuration>

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.datanode.max.xcievers</name>

<value>4096</value>

<description>

DataNode simultaneous processing of file caps

</description>

</property>

<!–property>

<name>dfs.http.address</name>

<value>0.0.0.0:50070</value>

</property–>

</configuration>

2.4.4mapred-site.xml

<?xml version= "1.0″?>

<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>

<!–put Site-specific property overrides in this file. –>

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>172.16.128.239:9002</value>

</property>

</configuration>

2.4.5masters

localhost

2.4.6slaves

localhost

2.5 Formatting Namenode

After the configuration file is complete, we can begin to format the Namenode, which is the first step for Hadoop to start using, just as we did with the hard drive. Open the Cygwin terminal to execute the following command:

Cd/cygdrive/c/hadoop/deploy/hadoop-1.2.1/bin

./hadoop Namenode–format

Hadoop installation

2.6 Starting the test Hadoop

After formatting the Namenode, we can execute the start-all.sh command to start Hadoop. As follows:

Execution:./start-all.sh

Hadoop installation

Open a browser to access the HDFS monitoring interface

http://localhost:50070

Hadoop installation

2.8Hadoop access to remote Namenode nodes using accumulate 2.8.1:

./hadoop Dfs-fs 172.16.128.239:9001-ls/user/lg/event_videos/2013/09/

2.8.2HDFS permission-related
    1. Super User

The superuser is the user who runs the name node process. Broadly speaking, if you start the name node, you are the super user. Super users do anything because the superuser is able to check through all the permissions. There are no permanent tokens to keep who used to be superuser, and when name node starts running, the process automatically determines who is now the superuser. The super user of HDFs does not have to be a superuser on the name node host, nor does it require all the super users of the cluster to be one. Similarly, an experimenter running HDFs on a personal workstation, without any configuration, became a super user of his deployment instance. In addition, administrators can use configuration parameters to specify a specific set of users, and if so, the members of this group will also be super users.

Web Server users

The identity of the Web server is a configurable parameter. Name node does not have a real user concept, but the Web server behaves as if it had the identity of the user selected by the administrator (user name and group). Unless the selected identity is Superuser, a portion of the namespace will not be visible to the Web server.

    1. Configuration parameters

Dfs.permissions = True

If True, the permission check is turned on. If False, the permission check is turned off, but the other behavior has not changed. Changes to this configuration parameter do not change the schema, owner, and group information for the file or directory.

Chmod,chgrp and chown always check permissions regardless of whether the permission mode is on or off. These commands are only useful in the context of permission checking, so there is no compatibility issue. This allows the administrator to reliably set the owner and permissions of the file before opening a general permission check.

Dfs.web.ugi = Webuser,webgroup

The user name used by the Web server. If you set this parameter to the name of the Superuser, all Web customers can see all the information. If you set this parameter to a user who is not in use, the Web client can only access resources that are accessible to the "other" permission. Additional groups can be added later to form a comma-delimited list.

Dfs.permissions.supergroup = SuperGroup

The group name of the super user.

Dfs.upgrade.permission = 777

The initial mode when upgrading. The file will never be set to x permissions. In the configuration file, you can use the decimal number 51110.

Dfs.umask = 022

The umask parameter is used when creating files and directories. In the configuration file, you can use the decimal number 1810.

    1. Default users and groups when creating a directory

The user who created the directory defaults to the current creator, that is, the user name given to the command WhoAmI in the Unix-like system, and the group that owns the parent directory.

Running Hadoop under Windows

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.