hadoop+hive Do data warehousing & some tests

Source: Internet
Author: User
Tags compact gz file stmt zookeeper ssh rsync hadoop fs
Reprint marked: www.bagbaby.cn
Http://hi.baidu.com/dd_shop

Background requirements and Status quo
The current log system is also not known as the system, just on several servers with all the logs, relying on NFS share data, and operation, the problems caused by many:
(a) data storage is messy and lack of systematic directory management;
b Storage space is limited, and expansion is cumbersome;
c) CV/PV, such as scattered storage, the merger is not convenient;
d The Media service log data is stored centrally and the data is large and difficult to make a lightweight backup;
(e) Loss of data has occurred and is not recoverable;
F Data Crawl performance is low, often become the operational bottleneck;


Demand analysis
A) log Capture & Storage
Journal name daily log size backup administrator
.............

(b) Log preprocessing
See attachment: Only write PV&CV log, no other log format.
Excel:****log.xlsx
C

Data preprocessing
d) specification data (cleaning)
Missing data processing
property values in multiple records are found to be empty in the PV log, such as Areaid,ip,referurl,open, and so on. For an empty property value, you can use the following methods
Make missing data
The record is ignored. If an attribute value is omitted from a record, the record is excluded from the data mining process, especially if the value of the category attribute is not
The primary classification data used. Of course, this method is not very effective, especially when there is a large difference in the record ratio of each attribute missing value.
Manually fill the missing value. Generally speaking, this method is time-consuming and is less feasible for large datasets with many omissions.
The default value is used to fill the missing value. All missing values for an attribute are filled with a predetermined value.
Inconsistent data processing
Real-world data often show inconsistencies in the content of data records, some of which are inconsistent and can be solved by hand with external associations. For example: in no
Inconsistent with the server encoding, preprocessing can help correct inconsistencies that occur when coding is used.
e) Data Conversion
Data conversion  is mainly to normalize the data. For example: For a customer information database in the age attribute or the wage attribute, due to the wage attribute of the
The value is much larger than the age attribute, and if not normalized, the distance calculated based on the wage attribute will obviously far exceed the computed value based on the age attribute, which means that the function of the wage attribute is in the distance of the entire data object.

was incorrectly amplified from the calculation.
(f) Data consolidation
Complex data analysis of large-scale database content usually takes a lot of time, which often makes such analysis impractical and not feasible, especially when interactive data mining is needed. Data reduction technology is used to help from

The original large dataset to obtain a compact data set, and make this thin dataset to maintain the integrity of the original dataset, so that data mining on a thin dataset is more efficient, and mining the results and use of the original

The results obtained by the set are basically the same
G

About Hadoop
。。。。。。。。
Hadoop family
The entire Hadoop consists of the following subprojects:
Member name use
Hadoop Common A low-level module of the Hadoop system that provides various tools for Hadoop subprojects, such as configuration files and log operations.
Avro Avro is the RPC project hosted by Doug Cutting, a bit like Google's Protobuf and Facebook's thrift. Avro is used to do later RPC of Hadoop, make Hadoop RPC module communicate faster, data structure is more compact


Chukwa Chukwa is a large cluster monitoring system based on Hadoop, which is contributed by Yahoo.
HBase is based on the Hadoop distributed File System and is an open source, distributed database based on the column storage model.
HDFS Distributed File System
Hive Hive is similar to Cloudbase and is a set of software based on the Hadoop distributed computing platform that provides data warehouse SQL functionality. The summary of the massive data stored in Hadoop, the simplicity of the ad hoc query. Hive

Provides a set of QL query language, based on SQL, easy to use.
MapReduce implements the MapReduce programming framework
Pig Pig, a sql-like language, is an advanced query language built on MapReduce that compiles some operations into the MapReduce model's map and reduce, and users can define their own capabilities. Yahoo Grid Computing Department

The development of another clone of Google's project Sawzall.
Zookeeper Zookeeper is an open-source implementation of Google's chubby. It is a reliable coordination system for large distributed systems, including configuration maintenance, name service, distributed synchronization, group service, etc.

Zookeeper's goal is to encapsulate complex and error-prone key services, delivering easy-to-use interfaces and high-performance, functionally stable systems to users.

Hadoop installation
h) Operating system
Linux 2.6.31-20-generic Ubuntu 9.1
i) must be software
Ssh
Apt-get Install Openssh-server

Rsync
Apt-get Install Rsync

java1.6
Apt-get Install Sun-java16-jar SUN-JAVA16-JDK

Ant
Apt-get Install Ant
j) Configure the Environment SSH password-free login:
Ssh-keygen-t Dsa-p '-F ~/.SSH/ID_DSA
Note: Single machine does not need the following blue font operation
SCP. Ssh/id_rsa.pub Hadoop@*.*.*.*:/home/hadoop/id_rsa.pub
Cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
Test Login:
SSH localhost or ssh *.*.*.*

K) Compiling
I. Download to the official website, I will not write
Ii. we've installed Hadoop in/usr/local/.
Tar zxvf hadoop-0.20.2.tar.gz
Ln-s hadoop-0.20.2 Hadoop
CD Hadoop

III. Configure Hadoop (my CP is the official default configuration, not written.) I wrote this is a stand-alone, cluster reference: http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html)
Conf/core-site.xml:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>


Conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>


Conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Iv. format Hadoop
Error if present: Error:java_home is not set. Indicates that Java home is not configured.
We configure the javahome to be global;
Vi/etc/environment
Add Jave_home and/usr/local/hadoop/bin:
Java_home= "/usr/lib/jvm/java-6-sun"
V. Start Hadoop
start-all.sh
VI. Check to see if Hadoop is normal
NETSTAT–NL |more
TCP6 0 0 127.0.0.1:9000:::* LISTEN
TCP6 0 0 127.0.0.1:9001:::* LISTEN
TCP6 0 0:: 50090:::* LISTEN
TCP6 0 0:: 50070:::* LISTEN

VII. Testing
Hadoop fs-put CHANGES.txt input/
Hadoop fs-ls Input
This example is the calculation of how many words
Hadoop jar hadoop-*-examples.jar grep input Output ' [A-z] +'
root@hadoop-desktop:/usr/local/hadoop# Hadoop fs-cat output/* |more
Cat:source must be a file.
3828.
1969 via
1375 to


viii.
L) API Introduction
See annex: Word:hadoop's Api.docx
M
Hive Installation
n Download, to the official download the latest version, I will not write.
o) decompression;
Tar zxvf hive-0.5.0-bin.tar.gz;
Ln–s Hive-0.5.0-bin Hive
p) Configure Hive Environment
Vi/etc/environment
Hive_home= "/usr/local/hive/"
Q) Creating hive Storage
Hadoop Fs-mkdir/user/hive/warehouse
Hadoop Fs-chmod G+w/user/hive/warehouse
R) Start Hive
Hive
Enter: hive> identifier
Create a pokes table.
Hivr> CREATE TABLE pokes (foo INT, bar STRING);
Load test data, the file loaded is 2 columns.
hive> LOAD DATA local inpath './examples/files/kv1.txt ' OVERWRITE into TABLE pokes;
Hive> Select COUNT (1) from pokes;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile Time:1
In average load for a reducer (in bytes):
Set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
Set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
Set mapred.reduce.tasks=<number>
Starting Job = JOB_201004131133_0018, tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0018
Kill Command =/usr/local/hadoop/bin/. /bin/hadoop Job-dmapred.job.tracker=localhost:9001-kill job_201004131133_0018
2010-04-13 16:32:12,188 Stage-1 map = 0, reduce = 0%
2010-04-13 16:32:29,536 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:32:38,768 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:32:44,916 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0018
Ok
500
Time taken:38.379 seconds

Hive> Select count (bar), bar from pokes group by bar;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size:1
In average load for a reducer (in bytes):
Set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
Set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
Set mapred.reduce.tasks=<number>
Starting Job = JOB_201004131133_0017, tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0017
Kill Command =/usr/local/hadoop/bin/. /bin/hadoop Job-dmapred.job.tracker=localhost:9001-kill job_201004131133_0017
2010-04-13 16:26:55,791 Stage-1 map = 0, reduce = 0%
2010-04-13 16:27:11,165 Stage-1 map = 100%, reduce = 0%
2010-04-13 16:27:20,268 Stage-1 map = 100%, reduce = 33%
2010-04-13 16:27:25,348 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0017
Ok
3 Val_0
1 val_10
...............
Time taken:37.979 seconds

s) Hive API:
Reference: http://hadoop.apache.org/hive/docs/current/api/org/apache/hadoop/hive/conf/
T) hive use MySQL to do meta
Reference: Http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
is to put meta information in MySQL, preventing HDFs from hanging without getting a list of data. The feeling is not necessary, because the HDFs is dead, there is no use meta information.
U

Some tests for Hive & Hadoop:
V loading GZ or bz2 format metadata to occupy Space & Time comparisons:

hive> Load Data local inpath ' ok.txt.gz ' overwrite into table page_test2 partition (dt= ' 2010-04-16 ');
Copying data from file:/usr/local/ok.txt.gz
Loading data to table Page_test2 partition {dt=2010-04-16}
Ok
Time taken:3.649 seconds
The following is the file size of the hive table that Hadoop stores:
root@hadoop:/tmp/hadoop-root/dfs/data/current# Du-ch blk_-945326243445352181
22M blk_-945326243445352181
22M Total

W) Loading of this document:
hive> Load Data local inpath ' ok.txt ' overwrite into table page_test partition (dt= ' 2010-04-17 ');
Copying data from File:/usr/local/ok.txt
Loading data to table Page_test partition {DT=2010-04-17}
Ok
Time taken:41.593 seconds
The following is the file size of the hive table that Hadoop stores:
root@hadoop:/tmp/hadoop-root/dfs/data/current# Du-ch blk_7538941016314062501
64M blk_7538941016314062501
64M Total

x) source File size:
root@hadoop:/usr/local# Du-ch Ok.txt
196M Ok.txt
196M Total

root@hadoop:/usr/local# Du-ch ok.txt.gz
22M ok.txt.gz
22M Total

Y) hive Query comparison:
You can see from the results that the compression of the data query faster than the compression of a little more quickly, strange.
GZ file import and create partitions using hive query: Hive> select COUNT (1) from Page_test2 A where a.dt= ' 2010-04-16 ';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile Time:1
In average load for a reducer (in bytes):
Set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
Set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
Set mapred.reduce.tasks=<number>
Starting Job = JOB_201004131133_0026, tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0026
Kill Command =/usr/local/hadoop/bin/. /bin/hadoop Job-dmapred.job.tracker=localhost:9001-kill job_201004131133_0026
2010-04-16 13:43:39,435 Stage-1 map = 0, reduce = 0%
2010-04-16 13:47:30,921 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0026
Ok
17166483
Time taken:239.447 seconds
TXT file import and create partition using hive QL query: hive> select COUNT (1) from Page_test A where a.dt= ' 2010-04-16 ';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile Time:1
In average load for a reducer (in bytes):
Set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
Set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
Set mapred.reduce.tasks=<number>
Starting Job = JOB_201004131133_0025, tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201004131133_0025
Kill Command =/usr/local/hadoop/bin/. /bin/hadoop Job-dmapred.job.tracker=localhost:9001-kill job_201004131133_0025
2010-04-16 13:37:11,927 Stage-1 map = 0, reduce = 0%
2010-04-16 13:42:01,382 Stage-1 map = 100%, reduce = 22%
2010-04-16 13:42:13,683 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201004131133_0025
Ok
17166483
Time taken:314.291 seconds
TXT did not create partition use hive query not recorded, is more than 400 seconds

Z) A
Hive Development
A) Open Hive service:
Open Hive Service on port 10000
hive_port=10000./bin/hive--service Hiveserver

b to see if the service is started:
NETSTAT–NL |grep 100000

c) Write test procedures:
The official give example, this I compile the past, the execution has the error, did not detect there problem.

Import java.sql.SQLException;
Import java.sql.Connection;
Import Java.sql.ResultSet;
Import java.sql.Statement;
Import Java.sql.DriverManager;

public class Hivejdbcclient {
private static String drivername = "Org.apache.hadoop.hive.jdbc.HiveDriver";

/**
* @param args
* @throws SQLException
*/
public static void Main (string[] args) throws SQLException {
try {
Class.forName (drivername);
catch (ClassNotFoundException e) {
TODO auto-generated Catch block
E.printstacktrace ();
System.exit (1);
}
Connection con = drivermanager.getconnection ("Jdbc:hive://localhost:10000/default", "" "," ");
Statement stmt = Con.createstatement ();
String tablename = "testhivedrivertable";
Stmt.executequery ("drop table" + tablename);
ResultSet res = stmt.executequery ("CREATE TABLE" + tablename + (key int, value string));
Show tables
String sql = "Show tables" + tablename + "'";
System.out.println ("Running:" + sql);
res = stmt.executequery (SQL);
if (Res.next ()) {
System.out.println (res.getstring (1));
}
Describe table
sql = "describe" + tablename;
System.out.println ("Running:" + sql);
res = stmt.executequery (SQL);
while (Res.next ()) {
System.out.println (res.getstring (1) + "T" + res.getstring (2));
}

Load Data into table
Note:filepath has to the hive server
Note:/tmp/a.txt are a ctrl-a separated file with two fields/
String filepath = "/tmp/a.txt";
sql = "Load data local inpath '" + filepath + "' into table" + tablename;
System.out.println ("Running:" + sql);
res = stmt.executequery (SQL);

SELECT * Query
sql = "SELECT * from" + tablename;
System.out.println ("Running:" + sql);
res = stmt.executequery (SQL);
while (Res.next ()) {
System.out.println (String.valueof (Res.getint (1)) + "T" + res.getstring (2));
}

Regular Hive Query
sql = "SELECT COUNT (1) from" + tablename;
System.out.println ("Running:" + sql);
res = stmt.executequery (SQL);
while (Res.next ()) {
System.out.println (res.getstring (1));
}
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.