Hadoop standalone pseudo-distributed deployment
Because there are not so many machines, we can deploy a Hadoop cluster on our own virtual machine. This is called a pseudo-distributed cluster. However, in any case, we mainly record the hadoop deployment process and problems, then use a simple program testing environment.
1. install JAVA, download the hadoop package, and configure hadoop environment variables.
Set JAVA_HOME to the installation directory of java, and add the directory where the hadoop program is located to the PATH environment variable of the system. In this way, you can directly start the hadoop command in shell. Hadoop 2.6.0 is used here.
2. Set SSH
The reason for installing ssh is that Hadoop needs to start the daemon processes on each machine in the slave list through ssh. Although we call it pseudo-distributed installation, however, hadoop is started as a cluster, but all the machines in the cluster are on the same machine. The default ssh port is 22. You can check port 22 to see if it has been installed and started. To enable hadoop to start the program through ssh, you need to use ssh without a password.
The ssh user@127.0.0.1 (to ensure that the ssh server and client are already installed on the local machine) is like this:
Bkjia @ bkjia :~ /Workplace $ ssh bkjia@127.0.0.1.
Bkjia@127.0.0.1's password:
Welcome to Ubuntu 13.10 (GNU/Linux 3.11.0-12-generic i686)
* Documentation: https://help.ubuntu.com/
Last login: Mon Jan 19 15:03:01 2015 from localhost
That is, you need to enter the user's password every time. To configure password-free logon, You need to execute the following command:
Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
The first command is to generate a key.-t indicates the drug type. Here we use the dsa authentication method,-P indicates the password, and here we use null, -f indicates the address for generating the key file. The second directory is to copy the public key of the generated key to the authorized key file of the current host. In this way, you do not need to use the ssh command to connect to the host. You can use the above ssh command again.
3. Configure the environment configuration file etc/hadoop/hadoop-env.sh of hadoop
This is the environment configuration file of hadoop. You need to configure the JAVA_HOME directory to ensure that the directory is the installation directory of java.
4. Configure the etc/hadoop/core-site.xml configuration file
<Configuration>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/bkjia/workplace/hadoop/data </value>
</Property>
<Property>
<Name> fs. default. name </name>
<Value> hdfs: // host address: 9000 </value>
</Property>
</Configuration>
5. Configure the MapReduce configuration file etc/hadoop/mapred-site.xml
<Configuration>
<Property>
<Name> mapred. job. tracker </name>
<Value> host address: 9001 </value>
</Property>
</Configuration>
6. Configure HDFS configuration file etc/hadoop/hdfs-site.xml
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 1 </value>
</Property>
<Property>
<Name> dfs. namenode. name. dir </name>
<Value>/home/bkjia/workplace/hadoop/hdfs/name </value>
</Property>
<Property>
<Name> dfs. datannode. data. dir </name>
<Value>/home/bkjia/workplace/hadoop/hdfs/data </value>
</Property>
</Configuration>
7. format the hdfs file system and start all modules.
Hadoop namenode-format
Format the HDFS file system.
Then execute./sbin/start-all.sh, the problem will occur at this time, as shown below:
Starting namenodes on [Java HotSpot (TM) Client VM warning: You have loaded library/home/bkjia/workplace/hadoop/hadoop-2.6.0/lib/native/libhadoop. so.1.0.0 which might have disabled stack guard. the VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack-c <libfile> ', or link it with'-z noexecstack '.
Check that this is caused by incompatibility of the platform. I downloaded a 64-bit version of hadoop, but my own machine is 32-bit. Therefore, we need to compile hadoop manually.
Bkjia @ bkjia-VirtualBox :~ /Workplace/hadoop/hadoop-2.6.0 $ file/home/bkjia/workplace/hadoop/hadoop-2.6.0/lib/native/libhadoop. so.1.0.0
/Home/bkjia/workplace/hadoop/hadoop-2.6.0/lib/native/libhadoop. so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID [sha1] = random, not stripped
Bkjia @ bkjia-VirtualBox :~ /Workplace/hadoop/hadoop-2.6.0 $ uname-
Linux bkjia.netease.com 3.11.0-12-generic # 19-Ubuntu SMP Wed Oct 9 16:12:00 UTC 2013 i686 i686 i686 GNU/Linux
However, maven depends on protobuf. Therefore, you must first download and install protobuf. The protobuf version must be 2.5.0 or later. Otherwise, the compilation fails. The maven version must be later than 3.0.2. You can download a maven execution file to build it.
This problem occurs during the construction process:
[ERROR] Failed to execute goal org. apache. maven. plugins: maven-antrun-plugin: 1.7: run (make) on project hadoop-common: An Ant BuildException has occured: exec returned: 1
[ERROR] und Ant part... <exec dir = "/home/bkjia/workplace/hadoop/hadoop-2.6.0-src/hadoop-common-project/hadoop-common/target/native" executable = "cmake" failonerror = "true" >... @ 4:152 in/home/bkjia/workplace/hadoop/hadoop-2.6.0-src/hadoop-common-project/hadoop-common/target/antrun/build-main.xml
Install the following two libraries:
Sudo apt-get install zlib1g-dev
Sudo apt-get install libssl-dev
Another problem occurs: org. apache. maven. lifecycle. lifecycleExecutionException: Failed to execute goal org. apache. maven. plugins: maven-antrun-plugin: 1.7: run (dist) on project hadoop-hdfs-httpfs: An Ant BuildException has occured: exec returned: 2
Google found that because of the lack of installation of forrest and do not know what this is, directly go to the official website (http://forrest.apache.org/) to download a latest version, it is installed in green (are these advantages of java? In the past, three steps were required to install the c ++ program ). Then set the environment variables FORREST_HOME and PATH to add the forrest bin directory and re-compile.
After forrest is added, this error still occurs, so I am determined not to compile this module. Find the hadoop-hdfs-project/pom. xml file and set <! -- Module> hadoop-hdfs-httpfs </module --> comment out the code and re-compile the code to complete the compilation smoothly. I don't know if it has any impact on hadoop.
At this time, the compiled hadoop-2.6.0 is under the hadoop-2.6.0-src/hadoop-dist/target/directory, and we can use this to overwrite the previously downloaded hadoop directory. Because this is a newly compiled program, you need to execute hadoop namenode-format again to initialize HDFS. Otherwise, an error will still occur.
The following error occurs when you run the start-dfs.sh again:
Starting namenodes on [bkjia.netease.com]
Bkjia.netease.com: Error: JAVA_HOME is not set and cocould not be found.
Localhost: Error: JAVA_HOME is not set and cocould not be found.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: Error: JAVA_HOME is not set and cocould not be found.
I think this is because JAVA_HOME is no longer valid in the new session after logging on to the new terminal through ssh (if it is set through export in shell, it is only valid in the current session, if you want to be valid for all sessions of the current user, you must ~ /. Add export JAVA_HOME = XXX to the bashrc file. If you want to be valid for all users, add export at the end of the/etc/profile file ). However, after this operation, create a new session through ssh and check that the JAVA_HOME variable is a valid value. However, this error still occurs, so you have to make an effort. In the libexec/hadoop-config.sh file, find "Error: JAVA_HOME is not set and cocould not be found. "This line shows that this line does not find the directory set by JAVA_HOME and is printed, so add the export JAVA_HOME = xxx statement before this if judgment, the value of JAVA_HOME can be found here. No error occurs on any terminal. However, if the path of JAVA_HOME of each machine is set differently on multiple machines, this requires individual modifications, which is indeed a problem.
Run the./sbin/start-dfs.sh and./sbin/start-yarn.sh script again to start all the processes required by hadoop, through jps to view the following processes:
Bkjia @ bkjia :~ /Workplace/hadoop $ jps
8329 SecondaryNameNode
8507 ResourceManager
Jps 8660
8143 DataNode
8023 NameNode
8628 NodeManager
Now, the hadoop standalone version has been compiled and deployed. In fact, the cluster deployment should be similar. It is easy to deploy it on multiple machines through ssh. Run a simple test program.
When learning a new language, we use "hello world" as an entry, and the word statistics program is mapReduce's "hello world". Below we will create a file consisting of English words, then count the number of times each word appears in this file. The following is an overview page of The hadoop official document:
Apache Hadoop 2.6.0
Apache Hadoop 2.6.0 is a minor release in the 2. x. y release line, building upon the previous stable release 2.4.1.
Here is a short overview of the major features and improvements.
Common
Authentication improvements when using an HTTP proxy server. This is useful when accessing WebHDFS via a proxy server.
A new Hadoop metrics sink that allows writing directly to Graphite.
Specification work related to the Hadoop Compatible Filesystem (HCFS) effort.
HDFS
Support for POSIX-style filesystem extended attributes. See the user documentation for more details.
Using the OfflineImageViewer, clients can now browse an fsimage via the WebHDFS API.
The NFS gateway has ed a number of supportability improvements and bug fixes. The Hadoop portmapper is no longer required to run the gateway, and the gateway is now able to reject connections from unprivileged ports.
The SecondaryNameNode, JournalNode, and DataNode web UIs have been modernized with HTML5 and Javascript.
YARN
YARN's REST APIs now support write/modify operations. Users can submit and kill applications through REST APIs.
The timeline store in YARN, used for storing generic and application-specific information for applications, supports authentication through Kerberos.
The Fair Scheduler supports dynamic hierarchical user queues, user queues are created dynamically at runtime under any specified parent-queue.
First, create a new file and copy the English content to the file:
Cat> test
Then place the newly created test file on the HDFS file system as the mapReduce input file:
./Bin/hadoop fs-put./test/wordCountInput
Run the HDFS command to place the local file test to the wordCountInput file in the root directory of HDFS. Run the ls command to check whether the execution is successful:
Bkjia @ bkjia-VirtualBox :~ /Workplace/hadoop/hadoop-2.6.0 $./bin/hadoop fs-ls/
Found 1 items
-Rw-r -- 1 bkjia supergroup 1400/wordCountInput
MapReduce test package in share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar, which is a jar file packaged by multiple test programs, we use the wordCount function to execute word statistics.
./Bin/hadoop jarshare/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount/wordCountInput/wordCountOutput
This command uses hadoop mapReduce to execute the wordcount program in the jar package. The input of this program is the HDFS/wordCountInput file (if this file is a directory, the input is all files in the directory), and the output is put in the/wordCountOutput directory of HDFS. A lot of INFO information is printed during execution. Let's take a look at some output information:
15/01/20 13:09:29 INFO Configuration. deprecation: session. id is deprecated. Instead, use dfs. metrics. session-id
15/01/20 13:09:29 INFO jvm. jv1_rics: Initializing JVM Metrics with processName = JobTracker, sessionId =
15/01/20 13:09:29 INFO input. FileInputFormat: Total input paths to process: 1
15/01/20 13:09:30 INFO mapreduce. JobSubmitter: number of splits: 1
15/01/20 13:09:30 INFO mapreduce. JobSubmitter: Submitting tokens for job: job_local810038734_0001
...
15/01/20 13:09:33 INFO mapred. MapTask: Starting flush of map output
15/01/20 13:09:33 INFO mapred. MapTask: Spilling map output
...
15/01/20 13:09:34 INFO mapreduce. Job: map 100% reduce 0%
...
15/01/20 13:09:35 INFO mapred. LocalJobRunner: Finishing task: attempt_local810038734_0001_r_000000_0
15/01/20 13:09:35 INFO mapred. LocalJobRunner: reduce task executor complete.
15/01/20 13:09:35 INFO mapreduce. Job: map 100% reduce 100%
15/01/20 13:09:36 INFO mapreduce. Job: Job job_local810038734_0001 completed successfully
15/01/20 13:09:36 INFO mapreduce. Job: Counters: 38
...
File Input Format Counters
Bytes Read = 1400
File Output Format Counters
Bytes Written = 1416
Then let's take a look at the result directory:
Bkjia @ bkjia-VirtualBox :~ /Workplace/hadoop/hadoop-2.6.0 $./bin/hadoop fs-ls/wordCountOutput
Found 2 items
-Rw-r -- 1 bkjia supergroup 0 2015-01-20 13:09/wordCountOutput/_ SUCCESS
-Rw-r -- 1 bkjia supergroup 1416/wordCountOutput/part-r-00000
We can see that there are two files under this directory, where the part-r-00000 is our execution result:
Bkjia @ bkjia-VirtualBox :~ /Workplace/hadoop/hadoop-2.6.0 $./bin/hadoop fs-cat/wordCountOutput/part-r-00000
Hadoop 5
The 5
A 4
And 7
For 4
Is 5
Now 3
Proxy 2
Release 3
The 9
To 4
User 3
Here we only extract some words that appear more than 2 times and the number of times they appear in the above file. It correctly counts the number of words in the above file, and then we can write a mapReduce program to implement various computing functions.
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)