Install and configure cdh4 impala

Source: Internet
Author: User
Tags hdfs dfs

Based on CDH, Impala provides real-time queries for HDFS and hbase. The query statements are similar to hive
Including several components
Clients: Provides interactive queries between hue, ODBC clients, JDBC clients, and the impala shell and Impala.
Hive MetaStore: stores the metadata of the data to let Impala know the data structure and other information.
Cloudera Impala: coordinates the query on each datanode, distributes parallel query tasks, and returns the query to the client.
Hbase and HDFS: Data Storage

Environment
Hadoop-2.0.0-cdh4.1.2
Hive-0.9.0-cdh4.1.2
Install Impala using yum
Add Yum Library
[Cloudera-Impala]
Name = impala
Base url = http://archive.cloudera.com/impala/redhat/5/x86_64/impala/1/
Gpgkey = http://archive.cloudera.com/impala/redhat/5/x86_64/impala/RPM-GPG-KEY-cloudera
Gpgcheck = 1
Add it to the/etc/yum. Repos. d directory.

Note that CDH must match hive and Impala versions. Go to the impala official website to check the version.
Requires a large memory size and 64-bit machines (we recommend that you forget whether 32-bit machines are supported). The supported Linux versions also have requirements.

Http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/PDF/Installing-and-Using-Impala.pdf

Install cdh4

Http://archive.cloudera.com/cdh4/cdh/4/

Both CDH and hive can be found here.

Three machines
Master to install namenode, secondnamenode, ResourceManager, Impala-state-store, Impala-shell, hive
Install datanode, nodemanager, Impala-server, and Impala-shell on slave1
Install datanode, nodemanager, Impala-server, and Impala-shell on slave2

Hadoop Configuration
Configure on the master machine
$ Core-site.xml added in hadoop_home/etc/hadoop

<Property>
<Name> Io. Native. Lib. Available </Name>
<Value> true </value>
</Property>
<Property>
<Name> fs. Default. Name </Name>
<Value> HDFS: // master: 9000 </value>
<Description> the name of the default file system. Either theliteral string "local" or a host: port for NDfS. </description>
<Final> true </FINAL>
</Property>

$ Hdfs-site.xml added in hadoop_home/etc/hadoop

<Property>
<Name> DFS. namenode. Name. dir </Name>
<Value> file:/home/hadoop/cloudera/hadoop/dfs/name </value>
<Description> determines where on the local filesystem the DFS namenode shocould store the name table. if this is a comma-delimited list ofdirectories, then name table is replicated in all of the directories, forredundancy. </description>
<Final> true </FINAL>
</Property>
<Property>
<Name> DFS. datanode. Data. dir </Name>
<Value> file:/home/hadoop/cloudera/hadoop/dfs/Data </value>
<Description> determines where on the local filesystem an DFS datanode shocould store its blocks. if this is a comma-delimited list ofdirectories, then data will be stored in all named directories, typically ondifferent devices. directories that do not exist are
Ignored.
</Description>
<Final> true </FINAL>
</Property>
<Property>
<Name> DFS. http. address </Name>
<Value> fca-vm-arch-proxy1: 50070 </value>
</Property>
<Property>
<Name> DFS. Replication </Name>
<Value> 2 </value>
</Property>
<Property>
<Name> DFS. Secondary. http. address </Name>
<Value> fca-vm-arch-proxy1: 50090 </value>
</Property>
<Property>
<Name> DFS. Permission </Name>
<Value> false </value>
</Property>

$ Mapred-site.xml added in hadoop_home/etc/hadoop

<Property>
<Name> mapreduce. Framework. Name </Name>
<Value> yarn </value>
</Property>
<Property>
<Name> mapreduce. Job. Tracker </Name>
<Value> HDFS: // fca-vm-arch-proxy1: 9001 </value>
<Final> true </FINAL>
</Property>
<Property>
<Name> mapreduce. Map. Memory. MB </Name>
<Value> 1536 </value>
</Property>
<Property>
<Name> mapreduce. Map. java. opts </Name>
<Value>-xmx1024m </value>
</Property>
<Property>
<Name> mapreduce. Reduce. Memory. MB </Name>
<Value> 3072 </value>
</Property>
<Property>
<Name> mapreduce. Reduce. java. opts </Name>
<Value>-xmx2560m </value>
</Property>
<Property>
<Name> mapreduce. task. Io. Sort. MB </Name>
<Value> 512 </value>
</Property>
<Property>
<Name> mapreduce. task. Io. Sort. factor </Name>
<Value> 100 </value>
</Property>
<Property>
<Name> mapreduce. Reduce. Shuffle. parallelcopies </Name>
<Value> 50 </value>
</Property>

$ Hadoop_home/etc/hadoop/hadoop-env.sh added
Export java_home =/jdk1.6.0 _ 22

System Environment Variables
$ Home/. bash_profile added
Export java_home =/jdk1.6.0 _ 22
Export java_bin =$ {java_home}/bin
Export classpath =.: $ java_home/lib/dt. jar: $ java_home/lib/tools. Jar
Export hadoop_home =/home/hadoop/cloudera/hadoop-2.0.0-cdh4.1.2
Export hadoop_mapred_home =$ {hadoop_home}
Export hadoop_common_home =$ {hadoop_home}
Export hadoop_hdfs_home =$ {hadoop_home}
Export hadoop_yarn_home =$ {hadoop_home}
Export Path = $ PATH :$ {java_home}/bin :$ {hadoop_home}/sbin :$ {hive_home}/bin
Export java_home java_bin path classpath java_opts
Export hadoop_lib =$ {hadoop_home}/lib
Export hadoop_conf_dir =$ {hadoop_home}/etc/hadoop

Source $ home/. bash_profile make the variable take effect

Yarn Configuration

$ Hadoop_home/etc/hadoop/yarn-site.xml added

<Property>
<Name> yarn. ResourceManager. address </Name>
<Value> fca-vm-arch-proxy1: 9002 </value>
</Property>
<Property>
<Name> yarn. ResourceManager. schedager. address </Name>
<Value> fca-vm-arch-proxy1: 9003 </value>
</Property>
<Property>
<Name> yarn. ResourceManager. resource-tracker.address </Name>
<Value> fca-vm-arch-proxy1: 9004 </value>
</Property>
<Property>
<Name> yarn. nodemanager. Aux-services </Name>
<Value> mapreduce. Shuffle </value>
</Property>
<Property>
<Name> yarn. nodemanager. aux-services.mapreduce.shuffle.class </Name>
<Value> org. Apache. hadoop. mapred. shufflehandler </value>
</Property>

$ Hadoop_home/etc/hadoop/slaves added
Slave1
Slave2

Copy the CDH directory and. bash_profile on the master node to slave1 and slave2, configure the environment variables, and configure SSH login without a password. I will not go into details on the Internet.

Start HDFS and Yarn

After the preceding steps are completed, log on to the master machine using the hadoop user and execute the following commands in sequence:
HDFS namenode-format
Start-dfs.sh
Start-yarn.sh
Run the JPS command to view the following information:
The master node successfully starts the namenode, ResourceManager, and secondarynamenode processes;
Slave1 and slave2 have successfully started the datanode and nodemanager processes.

Install hive
Hive only needs to be installed on the master because Impala-state-store requires hive to read metadata, and hive depends on the related system database (MySQL), so MySQL is installed.
Download hive

Http://archive.cloudera.com/cdh4/cdh/4/

Decompress hive

$ Home/. bash_profile added
Export hive_home =/home/hadoop/hive-0.9.0-cdh4.1.2
Export Path = $ PATH :$ {java_home}/bin :$ {hadoop_home}/sbin :$ {hive_home}/bin
Export hive_conf_dir = $ hive_home/Conf
Export hive_lib = $ hive_home/lib

Source $ home/. bash_profile make environment variables take effect
Add mysql-connector-java-5.1.8.jar under the hive/lib directory

$ Hive_home/CONF/hive. Site. xml added
<Property>
<Name> hive. MetaStore. Uris </Name>
<Value> thrift: // master: 9083 </value>
<Description> thrift URI for the remote MetaStore. Used by MetaStore client to connect to remote MetaStore. </description>
</Property>
<Property>
<Name> hive. MetaStore. Local </Name>
<Value> false </value>
</Property>
<Property>
<Name> javax. JDO. Option. connectionurl </Name>
<Value> JDBC: mysql: // master: 3306/hive? Createdatabaseifnoexist = true </value>
<Description> JDBC connect string for a JDBC MetaStore </description>
</Property>
<Property>
<Name> javax. JDO. Option. connectiondrivername </Name>
<Value> com. MySQL. JDBC. Driver </value>
<Description> driver class name for a JDBC MetaStore </description>
</Property>
<Property>
<Name> javax. JDO. Option. connectionusername </Name>
<Value> root </value>
<Description> username to use against MetaStore database </description>
</Property>
<Property>
<Name> javax. JDO. Option. connectionpassword </Name>
<Value> password </value>
<Description> password to use against MetaStore database </description>
</Property>
<Property>
<Name> hive. Security. Authorization. enabled </Name>
<Value> false </value>
<Description> enable or disable the hive client Authorization </description>
</Property>
<Property>
<Name> hive. Security. Authorization. createtable. Owner. Grants </Name>
<Value> all </value>
<Description> the privileges automatically granted to the owner whenever a table gets created.
An example like "select, drop" will grant select and drop privilege to the owner of the table </description>
</Property>
<Property>
<Name> hive. querylog. Location </Name>
<Value >$ {user. Home}/hive-logs/querylog </value>
</Property>
Because hive metstore is installed on a remote node, hive. MetaStore. Local is false.
Hive. MetaStore. Uris sets remote connection to metstore

Verification Successful status
After completing the preceding steps, verify that the hive installation is successful.

Run hive on the master command line and enter "show tables;". The following prompt is displayed, indicating that hive is successfully installed:
> Hive
Hive> show databases;
OK
Time taken: 18.952 seconds

Impala Installation

Install Impala-state-store on the master
Sudo Yum install Impala-state-store
Install Impala-shell on the master
Sudo Yum install Impala-shell

Configure impala
Modify/etc/default/impala

Impala_state_store_host = 192.168.200.114
Impala_state_store_port = 24000
Impala_backend_port = 22000
Impala_log_dir =/var/log/impala

Impala_state_store_args = "-log_dir =$ {impala_log_dir}-state_store_port =$ {impala_state_store_port }"
Impala_server_args = "\
-Log_dir =$ {impala_log_dir }\
-State_store_port =$ {impala_state_store_port }\
-Use_statestore \
-State_store_host =$ {impala_state_store_host }\
-Be_port =$ {impala_backend_port }"

Enable_core_dumps = false

Libhdfs_opts =-djava. Library. Path =/usr/lib/Impala/lib
Mysql_connector_jar =/home/hadoop/cloudera/hive/hive-0.9.0-cdh4.1.2/lib/mysql-connector-java-5.1.8.jar
Impala_bin =/usr/lib/Impala/sbin
Impala_home =/usr/lib/impala
Hive_home =/home/hadoop/cloudera/hive/hive-0.9.0-cdh4.1.2
# Hbase_home =/usr/lib/hbase
Impala_conf_dir =/usr/lib/Impala/Conf
Hadoop_conf_dir =/usr/lib/Impala/Conf
Hive_conf_dir =/usr/lib/Impala/Conf
# Hbase_conf_dir =/etc/Impala/Conf

Copy the hadoop core-site.xml, hdfs-site.xml, hive hive-site.xml to/usr/lib/Impala/Conf
Core-site.xml increase

<Property>
<Name> DFS. Client. Read. shortcircuit </Name>
<Value> true </value>
</Property>
<Property>
<Name> DFS. Client. Read. shortcircuit. Skip. checksum </Name>
<Value> false </value>
</Property>

Hdfs-site.xml increased and hadoop hdfs-site.xml increased

<Property>
<Name> DFS. datanode. hdfs-blocks-metadata.enabled </Name>
<Value> true </value>
</Property>
<Property>
<Name> DFS. datanode. Data. dir. Perm </Name>
<Value> 750 </value>
</Property>
<Property>
<Name> DFS. Block. local-path-access.user </Name>
<Value> hadoop </value>
</Property>
<Property>
<Name> DFS. Client. Read. shortcircuit </Name>
<Value> true </value>
</Property>
<Property>
<Name> DFS. Client. file-block-storage-locations.timeout </Name>
<Value> 3000 </value>
</Property>
<Property>
<Name> DFS. Client. Use. Legacy. blockreader. Local </Name>
<Value> true </value>
</Property>

Copy mysql-connector-java-5.1.8.jar to/usr/lib/Impala/lib
Copy mysql-connector-java-5.1.8.jar to/var/lib/impala
Copy/usr/lib/Impala/lib/*. So * to $ hadoop_home/lib/native/
Install slave1 and slave2
Sudo Yum install impala
Sudo Yum install Impala-Server
Sudo Yum install Impala-shell
Hive-site.xml on the master, core-site.xml, hdfs-site.xml copy to slaver1, slaver2, Jar copy is consistent with the master

Start hive MetaStore
Run hive -- service MetaStore on the master

Start Impala statestore
Execute statestored-log_dir =/var/log/Impala-state_store_port = 24000 on the master

Enable impalad on slave1 and slave2
Sudo/etc/init. d/Impala-Server start

Impala check whether/var/log/Impala/statestored. info is successful statestored. error check error
Note that you must first start hive MetaStore and Impala statesored on the master, and then start impalad-server on slave1 and slave2.

Test successful

Run on Master
Impala-shell
[Not connected]> connect slave1;
[Slave1: 21000]> Use hive;
Query: Use hive
[Slave1: 21000]> show tables;
OK
No error saying successful
If you want to insert data to slave1, You need to refresh the table name on slave2 to synchronize the data, instead of the refresh mentioned on the internet, you must add the table name.
If it is not a shell operation, you can synchronize data without testing.

Notes

Impala may encounter errors when inserting data.

Hdfsopenfile (HDFS: // fmaster: 9000/user/hive/warehouse/test/. 2038125373027453036 ......

This is a permission issue, because we started Impala with sudo (root user), but the hadoop user in the test table has the permission to add, delete, modify, and query, but root does not

Solution

Hdfs dfs-chmod-r 777/user/hive/warehouse/test

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.