Objective: to use eclipse on the local machine (Win7) to operate Hadoop on the virtual machine (redhat) for learning and experiment purposes. general workflow-Hadoop installation section: 1. implement ssh password-less authentication configuration in linux 2. install jdk in linux and configure environment variables
Objective: to use eclipse on the local machine (Win7) to operate Hadoop on the virtual machine (redhat) for learning and experiment purposes
Hadoop installation in the general workflow:
1. implement ssh password-less authentication configuration in linux.
2. install jdk in linux with environment variables
3. modify the linux machine name and configure/etc/hosts
4. download hadoop 0.20.0 in windows and modify the configuration of hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves files
5. Upload the entire modified hadoop folder to linux
6. add the hadoop bin to the environment variable.
7. format hadoop and start hadoop
8. verify that the instance is started and run wordcount.
Specific process:
Enter ssh-keygen-t rsa in the linux command line and press enter.
Ssh-keygen-t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/zhangtao /. ssh/id_rsa): // key storage location. press enter to keep the default value. Created directory '/home/zhangtao /. ssh '.
Enter passphrase (empty for no passphrase): // set the key password. if the password is empty, press Enter. Enter same passphrase again: // confirm the password set in the previous step.
Go to the/root/. ssh/directory and you will see two files: id_rsa.pub, id_rsa,
Run cp id_rsa.pub authorized_keys
Then ssh localhost is used to verify whether the request is successful. if you enter yes for the first time, you will not need it in the future.
2. install JDK and configure environment variables
This content is already included in previous blog posts.
3. modify the linux machine name
Let's talk about how to view the machine name in linux.
Enter the hostname in the command line and press enter to display the following machine names in the current linux system: [root @ hadoopName ~] # Hostname hadoopName
[Root @ hadoopName ~] #
As you can see, the prefix of the command line is [root @ hadoopName ~]. @ The symbol is followed by the machine name. The previous one is the current user name. let's talk about how to modify the machine name in redhat linux. the following method is only suitable for modifying redhat, this is not the case for changing the machine name of another version.
1. run cd/etc/sysconfig to enter the/etc/sysconfig directory.
2. execute vi network and modify the network file. NETWORKING = yes HOSTNAME = hadoopName
Change the HOSTNAME to the machine name you want. I changed it to hadoopName, and then save
3. run cd/etc to enter the/etc directory.
4. run vi hosts to modify the hosts file.
# Do not remove the following line, or various programs # that require network functionality will fail.
192.168.125.131 hadoopName
127.0.0.1 localhost. localdomain localhost
By default, only the black font content is added with the red font content. The first is its own ip address, the hostname in the second network, and the third is the same. A lot of information on the Internet says that the ip hostname of all cluster machines should be added to the hosts for hadoop installation. well, that's right, because I am a single machine, I just need to add myself.
5. after the modification, run the hostname command to view the new machine name (reboot may be required)
4. download hadoop 0.20.0 in windows and modify the hadoop-env.sh, core-site.xml, hdfs-site.xml,
Mapred-site.xml, masters, slaves file configuration
The following are the most important operations: Download hadoop and modify the configuration file.
Download hadoop 0.20.0. the downloaded file is hadoop-0.20.2.tar.gz, and decompress it.
Decompress the file structure is like this, enter the conf Directory, modify the hadoop-env.sh file, add the following line
Export JAVA_HOME =/usr/java/jdk1.6.0 _ 03
In fact, there is this line in the hadoop-env.sh, the default is annotated, you just need to remove the comment, and change JAVA_HOME to your java installation directory.
Need to say, before 0.20.2 version, conf has a hadoop-site.xml file, in the 0.20.0 version of conf without this hadoop-site.xml file, instead of three files, core-site.xml, hdfs-site.xml, mapred. xml. Below are the three files to modify the core-site.xml
The default core-site.xml is as follows
Href = "configuration. xsl"?>
Change it to the following:
Href = "configuration. xsl"?> Hadoop. tmp. dir /Usr/local/hadoop/hadooptmp A base for other temporary directories. Fs. default. name
Hdfs: // 192.168.133.128: 9000 The name of the default file system. a uri whose scheme and authority determine the FileSystem implementation. the uri's scheme determines the config property (fs. SCHEME. impl) naming the FileSystem implementation class. the uri's authority is used to determine the host, port, etc. for a filesystem.
Modify hdfs-site.xml
The default hdfs-site.xml is as follows
Href = "configuration. xsl"?>
To be changed to the following:
Href = "configuration. xsl"?> Dfs. replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
Modify mapred-site.xml
The default mapred-site.xml is as follows
Href = "configuration. xsl"?>
To be changed to the following:
Href = "configuration. xsl"?>
Mapred. job. tracker 192.168.133.128: 9001 The host and port that MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
After modifying these three files, let's talk about some important points.
1, in fact, core-site.xml corresponds to a core-default.xml, hdfs-site.xml corresponds to a hdfs-default.xml,
Mapred-site.xml corresponds to a mapred-default.xml. The three defalult files have some default configurations. now we modify these three site files to overwrite some configurations in default,
2. two important directory structures of the hadoop Distributed File System: the storage of namespace on namenode, the storage of datanode data blocks, and other file storage locations, these storage locations are based on hadoop. tmp. dir Directory. for example, the namespace of namenode is $ {hadoop. tmp. dir}/dfs/name, where the datanode data block is stored $ {hadoop. tmp. dir}/dfs/data,
Therefore, after setting the hadoop. tmp. dir Directory, other important directories are under this directory, which is a root directory. I set/usr/local/hadoop/hadooptmp. of course, this directory must exist.
3. fs. default. name: specifies the machine on which the namenode is located and the port number is hdfs: // 192.168.125.131: 9000. the format must be as follows. many documents on the Internet say that the IP address can also be written to localhost, I suggest you write an ip address, because when we talk about how to connect eclipse to hadoop in windows later, if we write localhost, the connection will fail.
4. mapred. job. tracker: set the machine on which jobtracker is located and the port number 192.168.125.131: 9001. the format is different from the previous one. this must also be written like this. The localhost and ip addresses are the same as the above.
5. dfs. replication: sets the number of data block copies. the default value is 3. because I have one machine here, I can only have one copy. I changed it to 1 and then modified the masters and slaves files.
In the master file, the ip address of the machine where the namenode is located in the cluster is written as 192.168.125.131. do not write localhost or localhost. in windows, eclipse cannot connect to hadoop.
In the server load balancer file, the ip address of all nodedata machines in the cluster is set to 192.168.125.131, because it is a single machine, and it is also better not to write localhost
Linux
Therefore, after setting the hadoop. tmp. dir Directory, other important directories are under this directory, which is a root directory. I set/usr/local/hadoop/hadooptmp. of course, this directory must exist.
3. fs. default. name: specifies the machine on which the namenode is located and the port number is hdfs: // 192.168.133.128: 9000. the format must be as follows. many documents on the Internet say that the IP address can be written to localhost, I suggest you write an ip address, because when we talk about how to connect eclipse to hadoop in windows later, if we write localhost, the connection will fail.
4. mapred. job. tracker, which is used to set the machine on which jobtracker is located and the port number 192.168.133.128: 9001. the format is different from the previous one. this must also be written in this way. Similarly, the localhost and ip addresses are the same as the above.
5. dfs. replication: sets the number of data block copies. the default value is 3. because I have one machine here, I can only have one copy. I changed it to 1 and then modified the masters and slaves files.
In the master file, the ip address of the machine where the namenode is located in the cluster is written as 192.168.133.128. do not write localhost or localhost. in windows, eclipse cannot connect to hadoop.
In the slaves file, the ip address of all nodedata machines in the cluster is written as 192.168.133.128, because it is a single machine, it is also best not to write localhost
5. Upload the entire modified hadoop folder to linux
After the above files are modified, copy the entire haoop directory to linux. remember to create a directory and put it here. the directory I created is/usr/local/hadoop, copy the entire hadoop directory to this directory, and then it is in this form [root @ hadoopName hadoop] # cd/usr/local/hadoop [root @ hadoopName hadoop] # ls hadoop-0.20.2 hadooptmp
/Usr/local/hadoop has two files, one is hadoop root directory hadoop-0.20.2, the other is the above hadoop. tmp. dir Directory
6. add the hadoop bin to the environment variable
Add the hadoop command to the environment variable so that the hadoop command can be executed directly in the command line.
The operation is the same as adding the java bin to the environment variable.
1. run cd/etc to enter the/etc directory.
2. execute vi profile and add the following to modify the profile file:
Export PATH =/usr/local/hadoop/hadoop-0.20.2/bin: $ PATH
3. run chmod + x profile to change the profile to an executable file.
4. execute source profile to apply the content in profile.
7. format hadoop and start hadoop
Format hadoop
Run hadoop namenode-format in the command line,
Start hadoop
Execute in the command line, start-all.sh, or start-dfs.sh, and then start-mapred.sh.
If a Permission error occurs, grant the permission to the chmod 777 file name.
Enter jps in the command line. if the following content appears, the startup is successful.
If the operation fails (command not found), add the java bin to the environment variable PATH.
[Root @ hadoopName ~] # Jps
4505 NameNode
4692 SecondaryNameNode
4756 JobTracker
Jps 4905
4854 TaskTracker
4592 DataNode
After it is started, the data directory will be generated in the dfs folder under/usr/local/hadoop/hadooptmp, where the data block data on datanode is stored, because I use a single machine, the name and data are on one machine. if it is a cluster, only the name folder will be available on the machine where namenode is located, and only the data folder will be available on datanode.
Run the hadoop fs-ls command to view the file directory structure of the current hdfs distributed file system,
Create a folder, run the command haoop fs-mkdir testdir, and then run hadoop fs-ls to display that the current user/root/testdir is root, therefore, the root directory of hdfs is/user/root. In fact, you can also attach n VMS.