Configure hadoop and hive

Source: Internet
Author: User
Tags gz file xsl hadoop fs

Recently, hadoop and hive have been successfully configured on five Linux servers. A hadoop cluster requires a machine as the master node, and the rest of the machines are Server Load balancer nodes (Master nodes can also be configured as Server Load balancer nodes ). You only need to configure and use hive on the master node.

1Configure hadoop

Hadoop configuration is relatively simple, because hadoop does not need to be installed. The following describes the installation and configuration steps in detail. Take hadoop 1.9.2 as an example.

(1)download the hadoop-0.19.2.tar.gz file from hadoop and decompress the generated hadoop-0.19.2 directory to the/search/hadoop directory (if you build it in other directories, pay attention to the following to modify the configuration items ).

Enter the command to establish a soft connection $ ln-s hadoop-1.9.2 hadoop (the advantage of doing so is that if you use another version of hadoop instead of reconfiguration)

(2) Both hadoop and hive require machine names. Run the hostname command to modify the machine name of the Local Machine. For example, to change the machine name of 10.10.60.139 to hadoop, enter $ hostname hadoop1.

Modify the/etc/hosts file and add the machine names in all hadoop clusters. The master node and all slave nodes must be added; otherwise, problems may occur. For example, add

10.10.60.139 hadoop1

10.10.60.140 hadoop2

10.10.63.32 hadoop3

10.10.65.48 hadoop4

10.10.67.36 hadoop5

(3) because the master machine requires SSH to log on to all slave nodes, the following configurations are required for all machines. In this article, hadoop1 is the master node.

Open the/etc/ssh/sshd_config file to ensure that SSH is not using the SSH2 protocol. Otherwise, modify all protocol 2 to Protocol 1. If the file is modified, run the service sshd restart command to restart the SSH service.

Type the following command

$ Cd ~ /. Ssh/

$ Ssh-keygen-T rsa1-c "hadoop_1"-F/root/. Ssh/identity

$ Cat identity. Pub> authorized_keys

The local public key file is saved to the authorized_keys file. In this case, the SSH localhost command can automatically log on to the local machine without entering the password (if the local IP address of the SSH machine is successful but the SSH localhost fails, you can open/etc/hosts. allow file and add 127.0.0.1 ).

To ensure that the master node does not need a password to directly SSH to the slave node, all slave machines need to rsync down the master machine/root /. SSH/identity. pub file, and cat to the local/root /. SSH/authorized_keys file. At this time, you can log on to the slave machine without entering the password on the master node using SSH + IP.

(4) Modify Environment Variables

Open the/etc/profile file and add

Export $ hadoop_home =/search/hadoop

Export Path = $ hadoop_home/bin: $ path

Close and run the $ source/etc/profile command, the environment variable takes effect.

(5) modify the two configurations of the $ hadoop_home/CONF/hadoop-env.sh configuration file

Export java_home = configure the local JDK or JRE path (if there is no JDK, You Can Yum install java-1.6.0-openjdk-devel, note that it must be java1.6 or later, otherwise hadoop cannot run properly)

# Hadoop occupies 2 GB of memory, which can be changed based on machine configuration.

Export hadoop_heapsize = 2000

(6) modify the $ hadoop_home/CONF/hadoop-site.xml file as follows. Note that the fs. Default. Name and mapred. Job. Tracker configuration items must use the machine name of the master node rather than the IP address. Otherwise, an error occurs when running hive.

 

 

<? XML version = "1.0"?> <Br/> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> </P> <p> <configuration> <br/> <property> <br/> <Name> FS. default. name </Name> <br/> <value> HDFS: // hadoop1: 9000 </value> <br/> <description> the name of the default file system. either the literal string "local" or a host: port for DFS. </description> <br/> </property> <br/> <Name> mapred. job. tracker </Name> <br/> <value> hadoop1: 9001 </value> <br/> <description> the host and port that mapreduce job tracker runs. if "local ", then jobs are run in-process as a single map and release C <br/> </property> <br/> <Name> hadoop. TMP. dir </Name> <br/> <value>/search/hadoop/tmp </value> <br/> <description> a base for other temporary directories. </description> <br/> </property> <br/> <Name> DFS. name. dir </Name> <br/> <value>/search/hadoop/filesystem/name </value> <br/> <description> determines where on the local filesystem DFS Name node shocould store the name table. if this is a comma-delimited list of <br/> </property> <br/> <Name> DFS. data. dir </Name> <br/> <value>/search/hadoop/filesystem/Data </value> <br/> <description> determines where on the local filesystem DFS data node shoshould store its blocks. if this is a comma-delimited list of dire <br/> </property> <br/> <Name> DFS. replication </Name> <br/> <value> 1 </value> <br/> <description> default block replication. the actual number of replications can be specified when the file is created. the default is used <br/> </property> <br/> </configuration> <br/>

 

 

 

(7) Modify $ hadoop_home/CONF/masters of all machines and specify the IP address of the master node;

Modify $ hadoop_home/CONF/slaves machines of all machines, and specify the IP address of the slave node in the cluster in each line;

Slaves file example:

10.10.60.139

10.10.60.140

10.10.63.32

10.10.65.48

10.10.67.36

(8) run the $ hadoop namenode-format command on all machines to format namenode.

(9) run $ hadoop_home/bin/start-all.sh on the master node to start hadoop. Note that you do not need to enter this command on the server Load balancer instance.

Enter the JPS command in shell to view the started hadoop process. For example

11304 datanode

JPS 15763

11190 namenode

11516 jobtracker

11636 tasktracker

11437 secondarynamenode

Note that the master node must include the namenode, secondarynamenode, and jobtracker processes. The slave machine must include the datanode and tasktracker processes to ensure successful startup.

To stop, run $ hadoop_home/bin/stop-all.sh

HadoopQuery interface

Http: // master machine IP Address: 50070/dfshealth. jsp

Http: // master machine IP Address: 50030/jobtracker. jsp

HadoopCommon commands

Hadoop DFS-ls is to view the content in the/usr/root directory. By default, if no path is specified, this is the current user path;

Hadoop DFS-rmr xxx is used to delete directories. It is easy to get started with many commands;

The hadoop dfsadmin-Report command allows you to view the datanode situation globally;

Adding parameters after a hadoop job is an operation on the currently running job, such as list and kill;

Hadoop balancer is the aforementioned command for balancing disk loads.

2 hiveConfiguration

Hive is also a set of software that provides data warehouse SQL functions based on the hadoop distributed computing platform. This simplifies ad-hoc queries by aggregating massive data stored in hadoop. Hive provides a set of Ql query languages, which are SQL-based and easy to use. Hive is already provided in the hadoop directory, but the version is relatively low and is not recommended. Configure hive on the master node as follows:

(1) download hive-0.5.0-bin.tar.gz from the official website and decompress it to generate the hive-0.5.0-bin folder. Create a soft link named hive in the $ hadoop_home/directory.

(2) Add environment variables in/etc/profile

Export hive_home = $ hadoop_home/hive

Export Path = $ hive_home/bin: $ path

(3) run the following command:

$ Hadoop_home/bin/hadoop FS-mkdir/tmp

$ Hadoop_home/bin/hadoop FS-mkdir/user/hive/warehouse

$ Hadoop_home/bin/hadoop FS-chmod g + w/tmp

$ Hadoop_home/bin/hadoop FS-chmod g + w/user/hive/warehouse

(4) Now you can use hive through CLI. However, this method only supports single-user testing. In practical applications, hive metadata (schemal) is often stored in MySQL. In this way, multiple users can be supported.

Therefore, you need to modify the $ hive_home/CONF/hive-default.xml configuration file

Configuration item

Value

Javax. JDO. Option. connectionurl

JDBC: mysql :///? Createdatabaseifnotexist = true

Javax. JDO. Option. connectiondrivername

Com. MySQL. JDBC. Driver

Javax. JDO. Option. connectionusername

 

Javax. JDO. Option. connectionpassword

 

(Reference article http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx)

(5) download the mysql-connector-java-5.1.11-bin.jar file from the Internet, and put it in the $ hive_home/lib directory, Then hive has all been configured.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.