Hadoop, commonly known as distributed computing, was initially an open-source project and originally originated from Google's two white papers. However, just like Linux a decade ago, although Hadoop was initially very simple, with the rise of big data in recent years, it has also gained a stage to fully reflect the value. This is exactly why Hadoop is widely used in the industry as the next Linux.
1. Brief introduction to the hadoop Environment Based on vmware
This article describes how to install a hadoop Cluster Based on Multiple vmwarevirtual machines. This small cluster allows you to study hadoop-related work processes on a local computer, some people may have questions about the results of research on small Virtual Machine clusters. Can write programs work normally on large sets of clusters? Yes.
One feature of Hadoop is linear growth. In the current number of cases, the processing time is 1. If the data volume is doubled, the processing time is doubled. In this case, if the processing capability is doubled, the processing time is also 1.
In normal cases, hadoop requires a large number of servers to be built. However, we learned how to find such servers at home. We can find several PCs and install the linux system on the PC.
Of course, we still have a simpler way: Find a high-performance computer, install Virtual Machine Software on the computer, create several virtual machines in it, and then make these virtual machines form a small internal LAN, on this network, we can install linux software, java software, and hadoop program. Then we can create a simple hadoop Research System for software development and debugging, programs developed on this small distributed cluster can be seamlessly transplanted to hadoop of the same version (the compatibility of different hadoop versions is not good, especially when the earlier version is the same as the later version, their APIs are also slightly changed) in the cluster.
The following is a hadoop virtual machine system built in my notebook. The related network topology is as follows:
VM 0, machine name: db, ip: 192.168.186.10
Virtual Machine 1, machine name: red, ip: 192.168.186.11
Virtual Machine 2, machine name: mongdb, ip: 192.168.186.12
VM 3, machine name: nginx, ip: 192.168.186.13
For example, four virtual machines are interconnected through a virtual switch, and the Development machine is also connected to the virtual switch. The virtual switch is connected to adsl, so that the entire system can access the Internet directly.
The following are typical configurations of several virtual machines:
Db configuration. As shown above, the memory configuration of this machine is relatively high. This machine is the master server of the machine and requires a large amount of memory. Therefore, 1.3g memory is configured.
The configuration of red is as follows. The configuration of mongdb and nginx is the same as that of this machine. The configuration is as follows:
The configuration of this machine is the same as that of the db machine, but the memory is smaller. The memory here is 188 MB, which is enough for the debugging environment.
2. Configure the nic ip address based on the vmwarevm Environment
Static ip configuration is used to prevent unnecessary confusion caused by dhcp allocation of new ip addresses after restart. The configuration is as follows:
2. Hosts file configuration
[Root @ db ~] # Cat/etc/hosts
# Do not remove the following line, or various programs
# That require network functionality will fail.
127.0.0.1 localhost. localdomain localhost
: 1 localhost6.localdomain6 localhost6
192.168.186.10 db
192.168.186.11 red
192.168.186.12 mongdb
192.168.186.13 nginx
Make sure that the hosts file configuration for each machine is shown in.
Simple configuration method: After configuring a machine, you can write a script to automatically copy the hosts file to multiple machines. The script is as follows:
[Root @ db ~] # Cat update_hosts.sh
#! /Bin/sh
For host in red mongdb nginx; do
Echo $ host
Scp/etc/hosts root @ $ {host}:/etc/
Done
[Root @ db ~] #
The main script should be written under root, and then grant the executable permission (chmod a + x *. sh), and then execute it under root permission. The script automatically copies the hosts file.
After the above process is complete, log on to each machine separately and ping each server. If the process fails, check carefully.
3. Java Configuration
Check whether java is correctly installed on each virtual machine server, and whether java environment variables are configured.
For example, if java-version is input at 1, the output is similar to that at 2, indicating that java has been correctly installed.
At the same time, use the three commands, env | grep JAVA_HOME, to check whether the environment variables are correctly configured. If the java environment variables are not configured, You need to configure them.
Run the following command to install java
Yum install java-1.7.0-openjdk
Run the following command to configure the environment variables.
Vi/etc/profile
After opening the edited file, add the following content at the end of the file:
JAVA_HOME =/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.25/jre
JRE_HOME =/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.25/jre
PATH = $ PATH: $ JAVA_HOME/bin
CLASSPATH =.: $ JAVA_HOME/lib/dt. jar: $ JAVA_HOME/lib/tools. jar
Export JAVA_HOME JRE_HOME PATH CLASSPATH
Save and exit. Finally, run the following command to make the configuration take effect.
Source/etc/profile
Then test the process again to see if it meets the relevant needs. If you haven't searched the internet yet.
4. ssh Login-free Configuration
Hadoop manages servers remotely through ssh, including starting and stopping hadoop management scripts.
For more information about how to configure ssh password-free logon, see the following sections:
Hadoop1.2.1 Pseudo distribution mode configuration of Pseudo do-Distributed
Http://www.iigrowing.cn/hadoop1-2-1-pseudo-distributed-wei-fen-bu-mo-shi-pei-zhi.html
5. Disable related firewalls
During the working process of the Hadoop program, many programs need to communicate with each node. Therefore, we need to handle the firewall to ensure that the access works properly. Here, the simplest way is to close all the firewalls of cluster virtual machines.
In the virtual machine, start the setup program
Select firewall work item
In the dialog box below, select the related options of the area with the following special colors, and then select OK to launch
6. Download The hadoop Program
The process is omitted.
3. Configure the hadoop distributed cluster 1. Download The hadoop program from the VM db and decompress it to the/work/apps/hadoop directory. I believe you will perform this operation, but Baidu will not be able to perform it.
2. Configure hadoop Environment Variables
Go to the conf directory, edit the hadoop-env.sh file,
Modify the configuration of java home
JAVA_HOME =/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.25/jre
Note that this configuration should be the same as that configured in java.
3. Create a directory
The Name directory stores the hdfs directory of namenode and metadata of the file.
Data DIRECTORY stores datanode Data
4. Configure the core-site.xml File
Vi core-site.xml
<? Xml version = "1.0"?>
<? Xml-stylesheet type = "text/xsl" href = "configuration. xsl"?>
<! -Put site-specific property overrides in this file.->
<Configuration>
<Property>
<Name> hadoop. tmp. dir </name> // temporary file directory
<Value>/work/apps/hadoop/tmp </value>
<Description> A base for other temporary directories. </description>
</Property>
<Property>
<Name> fs. default. name </name> // address of the namenode Server
<Value> hdfs: // db: 9000 </value>
</Property>
<Property>
<Name> fs. trash. interval </name> // interval between objects in the recycle bin
<Value> 1440 </value>
<Description> Number of minutes between trash checkpoints.
If zero, the trash feature is disabled.
</Description>
</Property>
</Configuration>
How can I determine which configurations of this file can be filled in and their meanings?
This file corresponds to a default configuration file, in:
Open the file as follows:
You can configure the project meanings in the file. Note that it is not the project configured in this file. If it is written here, it does not have any effect on the system. Of course, the configuration is also invalid.
5. Configure hdfs-site.xml
Enter the following command vi hdfs-site.xml
Pay attention to the area-related configurations mentioned above.
<? Xml version = "1.0"?>
<? Xml-stylesheet type = "text/xsl" href = "configuration. xsl"?>
<! -Put site-specific property overrides in this file.->
<Configuration>
<Property>
<Name> dfs. name. dir </name> // Where namenode stores data
<Value>/work/apps/hadoop/name </value> // a list of multiple files can be placed here, separated by commas (,). After the system synchronously writes data to these directories, to ensure the security of metadata. We recommend that these directories be stored on different physical disks to improve the io performance of the system. In addition, it is best to write data to one or several copies of another server through nfs, which ensures that the metadata is retained without errors.
</Property>
<Property>
<Name> dfs. data. dir </name> // Where datanode stores data
<Value>/work/apps/hadoop/data </value> // a list of disk directories separated by commas, when storing data, the system places different methods of data rotation in different directories. Generally, Block 1 of A file is placed in directory A, and block 2 is placed in directory B, which can fully improve the system performance.
</Property>
<Property>
<Name> dfs. replication </name> // number of copies of the file
<Value> 3 </value>
</Property>
</Configuration>
6. Configure the mapred-site.xml File
Vi mapred-site.xml
<? Xml version = "1.0"?>
<? Xml-stylesheet type = "text/xsl" href = "configuration. xsl"?>
<! -Put site-specific property overrides in this file.->
<Configuration>
<Property>
<Name> mapred. job. tracker </name> // sets the job tracker.
<Value> db: 9001 </value>
</Property>
</Configuration>
7. Configure the masters and slaves files
View the content of the following two files:
Storage in the masters file, secondary server configuration
The server list that stores datanode and tasktractor in slaves
These two files do not need to be distributed to the slaves node, but we can easily process them here. These files are not excluded from the script, so you can use some configuration files to create relevant locations.
Iv. Test the hadoop system 1. Distribute the configured hadoop system to each server
Create the following script program
[Root @ db apps] # vi scp_hadoop.sh
The script content is as follows:
#! /Bin/sh
For host in red mongdb nginx; do
Echo $ host
Scp-r/work/apps/hadoop sch @ $ {host}:/work/apps/
Done
After saving and exiting, modify the executable attributes of the file (chmod a + x *. sh)
Then execute the above script under the appropriate account. The script copies the configured hadoop program to other servers.
2. Start the hadoop System
Go to the hadoop directory,
Run the following command to format the hadoop File System:
Bin/hadoop namenode-format the namenode file system, as shown in figure
Then enter the bin/start-all.sh command to start the hadoop system, related records such:
3. Verify hadoop startup results
Run the following command to stick to the started java Process:
Ps-ef | grep java | awk '{print $1, $9, $11 }'
The jps process has not been installed due to the problem of the installed openjdk version. Therefore, you have to use the following command to temporarily check the java Process.
Verify the java running status of other servers as follows:
For example, the result of java Process verification after logging on to different servers is displayed in different regions.
During the entire process, some errors occur by checking the relevant logs, then performing special processing, and finally passing the debugging.
After all, Hadoop is not a common program. It is impossible to simply use it. It requires careful research, practice, and continuous practice. The most important thing is to improve the working ability and understanding of hadoop during debugging.