Hive Deployment and Optimization configuration

Source: Internet
Author: User
Tags subq

1.1 Installation

1.Hive relies on the Hadoop environment, all installed before installing hadoop,hive generally installed on the Namenode node of Hadoop;
2. Download hive latest release version (0.7.1) http://www.apache.org/dyn/closer.cgi/hive/
3. Extract the compressed installation package to a directory, such as/home/hadoop/hive
4. Preparation of metadata storage method
The default metadatga is stored in Derby, which does not allow multiple users to log in to the hive shell at the same time, all of which are recommended for MySQL.
1) Download the MySQL driver (http://download.csdn.net/detail/linjsh/2853585) and copy the driver Mysql-connector-java-5.1.11-bin.jar to/home/ Under the Hadoop/hive/lib directory;
2) Copy the Hive-default.xml from the/home/hadoop/hive/conf directory and rename it to Hive-site.xml
Cp/home/hadoop/hive/conf/hive-default.xml/home/hadoop/hive/conf/hive-site.xml
3) Compounding Hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
&LT;DESCRIPTION&GT;JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against Metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepasswd</value>
<description>password to use against Metastore database</description>
</property>
In the above configuration, Javax.jdo.option.ConnectionURL indicates the host, port, database and other information connected to MySQL.
After the configuration is complete, execute/home/hadoop/hive/bin/hive under the shell into the hive console, after entering the console, a "hive>" prompt will appear, enter the show Tables command, if the execution does not have a problem stating that the installation was successful.

1.2 Basic operation of Hive

1) Creating table: Create Table pokes (foo int,bar STRING);
2) Create the table and create the partition field DS
CREATE TABLE invites (foo Int,bar string) partitioned by (DS STRING);
3) Show all tables show TABLES;
4) display the table as a positive condition (regular expression), show TABLES '. *s ';
5) Table adds a list of ALTER TABLE pokes add COLUMNS (New_col INT);
6) Add a column and add a column field comment alter TABLE invites add COLUMNS (new_col2 INT COMMENT ' a COMMENT ');
7) Change the table name alter tables events RENAME to 3KOOBECAF;
8) Load the data in the file into the table
LOAD DATA LOCAL inpath './examples/files/kv1.txt ' OVERWRITE into TABLE pokes;
9) Load local data with the given partition information
LOAD DATA LOCAL inpath './examples/files/kv2.txt ' OVERWRITE into TABLE invites PARTITION (ds= ' 2008-08-15′);
10) Load DFS data with a given partition information
LOAD DATA inpath '/user/myname/kv2.txt ' OVERWRITE into TABLE invites PARTITION (ds= ' 2008-08-15′);
11) Output the query results to a local directory file
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out ' SELECT a.* from pokes A;
12) Create an index for the table
Create an index for table Userloginlog
CREATE index Index_user_login_log on table Userloginlog (userid) as ' Org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler ' with DEFERRED REBUILD;
Update indexes for Table Userloginlog
ALTER INDEX index_user_login_log on Userloginlog REBUILD;
Show indexes that belong to table Index_user_login_log
Show index on Index_user_login_log;
Delete the index in table Userloginlog Index_user_login_log
Drop index index_user_login_log on Userloginlog;
13) Start Hive Service
Bin/hive–service Hiveserver 10000

1.3 Step-up for data analysis using hive

1. Create a table corresponding to the log file field, such as a log file
CREATE TABLE userloginlog (userid string,char_id string,user_name String,
User_ip string,login_time string,product_id String,
region_id string)
ROW FORMAT delimited fields TERMINATED by ' \ t ' STORED as Textfile
Location '/home/hive/hivedata/login/';
2. Importing the log file data into a hive table is done in two ways, one through the load Data command:
LOAD DATA LOCAL inpath '/home/hive/test/login.log ' OVERWRITE into TABLE userloginlog, and the log file is uploaded directly to HDFs via the SH command.
Hadoop dfs-put/home/hive/test/login/*/home/hive/hivedata/login/
3. Data analysis based on hive table, such as analyzing the number of user logins per region and saving the results to the/tmp/result_out file
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/result_out ' Select count (userid) from Userloginlog Group by region_id;
SELECT count (userid) from Useractivelog JOIN
Userloginlog on (useractivelog.user_id = Userloginlog.userid);

2.2 SQL Optimization

1. Row cropping (column pruning)
When reading the data, only the columns that are needed in the query are read, and the other columns are ignored. For example, for a query:
SELECT a B from T WHERE e < 10;
Where T contains 5 columns (a,b,c,d,e), column c,d will be ignored and only a,b,e columns will be read
This option defaults to true: HIVE.OPTIMIZE.CP = True
2. Partition cropping (Partition pruning)
Reduce unnecessary partitions during the query. For example, for the following query:
SELECT * FROM (select C1,count (1)
From T GROUP by C1) SUBQ
WHERE SUBQ.PRTN = 100;

SELECT * from T1 JOIN
(SELECT * from T2) Subq on (T1.C1=SUBQ.C2)
WHERE SUBQ.PRTN = 100;
This option is true by default:hive.optimize.pruner=true
3.Join there is a principle when using a query statement that has a JOIN operation: the table/subquery with fewer entries should be placed to the left of the join operator. The reason is that in the reduce phase of the join operation, the contents of the table on the left side of the join operator are loaded into memory, and the table with fewer entries on the left side can effectively reduce the chance of an oom error.
4. Merging small files
Too many small files can put a strain on hdfs and seriously affect processing efficiency, and you could eliminate this effect by merging the results files of map and reduce. For small file problems, Hadoop itself offers several solutions, namely Hadoop archive,sequence file and Combinefileinputformat. Please refer to http://dongxicheng.org/mapreduce/hdfs-small-files-solution/for specific practices.
5. Create an index
Testing to create an index does not improve query speed.

2.3 Based on partition optimization

Reference http://www.tbdata.org/archives/1765

3.1 Test environment
Machine Use System Hardware configuration
Hadoop1 Master CENOS5.S, Linux version 2.6.18-194.32.1.el5 Intel (R) Celeron (r) CPU 2.80GHz
1GMemory
Hadoop2 Slave CENOS5.S, Linux version 2.6.18-194.32.1.el5 VM (Pentium (R) dual-core CPU 2.5GHz)
1GMemory
Hadoop3 Salve CENOS5.S, Linux version 2.6.18-194.32.1.el5 4CPU Intel (R) Xeon (r) E5606 @ 2.13GHz
4GMemory
3.2 Test Data

April 30月-June 4th a total of 36 days of user log, the file size is 4.3g,6008, the data is in the form of MySQL data files by day storage (one day data stored in a single table, a total of 36 tables).

3.3 Importing test data

1. Create a database named hive in MySQL and copy the data files to the Hive subdirectory of the MySQL data directory, so that there are 36 tables in the Hive library;
2. Use the script to export 36 tables of data to the CSV text, the command is as follows:
SELECT * from login_game_20110430 to OUTFILE '/var/lib/mysql/output/log1.csv ' fields TERMINATED by ' \ t ' LINES TERMINATE D by ' \ n ';
SELECT * from login_game_20110501 to OUTFILE '/var/lib/mysql/output/log2.csv ' fields TERMINATED by ' \ t ' LINES TERMINATE D by ' \ n '; the corresponding CSV file is log1.csv,log2.csv.......log36.csv;
3. Go to the Hive command line and create a table corresponding to the log file fields as follows:
CREATE TABLE userloginlog (userid string,char_id string,user_name String,
User_ip string,login_time string,product_id String,
region_id string)
ROW FORMAT delimited fields TERMINATED by ' \ t ' STORED as Textfile
Location'/home/hive/hivedata/login/'
The directory behind the location is where the data in the table is stored in HDFs
1. Bulk import of Log1.csv,log2.csv.....log36.csv to the directory where the table Userloginlog data file is located, i.e. the directory specified in the table is created "/home/hive/hivedata/login/
Hadoop dfs-put/home/hive/test/datafrommysql/sum.csv/home/hive/hivedata/login/(upload time is9minutes
After the following steps are gathered, the data has been fully imported and can be tested.

3.4 Test Results

Execution time in Hive Execution time in MySQL
Select * from Userloginlog where uid= ' 3 min 4 sec (on indexed)
3 min 39 sec (indexed 23 minutes to create an index)
47 seconds (no indexed)
Within 1 seconds (indexed)
Select count (userid) from Userloginlog 3 minutes, 40 seconds. Within 1 seconds

Hive Deployment and Optimization configuration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.