Hadoop2.20+hive+sqoop+mysql Data Processing case

Last Update:2015-02-05 Source: Internet

Author: User

Tags sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Description of the Business

Using HADOOP2 and other open source frameworks, the local log files are processed and the data required after processing (PV, UV ... Re-import into the relational database (MYSQL), using Java programs to process the result data, organized into a report form in the data behind the display.

Second, why useHadoop (Development Motivation)

The existing log processing method, is the pure Java program to calculate, for big Data log file, this scheme in processing speed, IO occupation, data backup and resource utilization has a certain defect;

In contrast, using Hadoop to accomplish data storage and computing has the following benefits:

(1) High reliability: Hadoop supports multi-copy backup of data;
(2) High scalability: Hadoop distributes data between available computer sets and completes computing tasks, and these clusters can be easily extended to thousands of nodes.
(3) Efficiency: By distributing data, Hadoop can process them in parallel (parallel) on the node where the data resides, which makes processing very fast.
(4) High fault tolerance: Hadoop is able to automatically save multiple copies of data and automatically reassign failed tasks.

(5) High resource utilization: The administrator can set up different resource scheduling schemes (yarn) According to the current server configuration, so as to maximize resource utilization.

Third, data processing flow chart

Iv. Category Contribution rate case process

First, the case business objectives:

Statistics December 16, 2014 All day, the user clicks the number TOP20 Item Category entry

Ii. Development Steps

(1) Upload the December 16, 2014 log (approximately 16G) to the/input/log_2014-12-16 directory of the HDFs file system

(2), write the MapReduce program cleaning data

Program Logic Analysis:

Filter out the URL contained in the log field? Id= ... Line, which means that the user clicked on an entry with an ID for the item, and the goal of the program was to filter only the row data containing the URL and discard the rows that were not included. The cleaned data is stored in the/output/cleanitem_2014_12_16 directory of HDFs (approximately 140MB after cleaning)

(3), map the cleansed data to the hive table

To the client side of hive, create an external partition table Mmb_ordercontribute, which corresponds to the field one by one of the log, but one more data partition field, which is equivalent to the index in MySQL, and when queried by date, the efficiency is high. The following is a list of the Mmb_ordercontribute statements in hive:

, the map path location of the table points to the data directory after cleaning

(4) Manipulating Hive table data

After the data has been mapped, the table data can be processed in the form of SQL statements like MySQL, such as:

Count the total number of access entries for all users on the day:

The results of the statistics are: 904423

Although it is a MySQL-like query, but the process of processing and MySQL is the difference between the day, Hive will parse this statement into a MapReduce program to do statistics, hive processing the query cup to use the total time is about 6 minutes, and allocated to the logical CPU, The actual use time is 28.5 seconds, and if the parsing and preparation of MapReduce is removed, the execution time of MapReduce is very small.

But hive is not all SQL statements mapped to a MapReduce program, the only exception (not configured) is: SELECT * FROM table_name LIMIT, because this is only the first few data in the interception table;

(5) Processing cleaning data using hive and exporting to a temporary table

The data that is really needed will be drawn in this step, that is: the user clicks on the category of the Top20, and the data is counted as a temporary table of hive:

item_20141216 table storage, ready for use:

SQL statements:

CREATE TABLE

item_20141216

As SELECT

Itemid,count (1) as Nums

From

Mmb_ordercontribute

GROUP by

ItemId

ORDER by

Nums DESC;

The exported table item_20141216 is stored under Hive's default Data warehouse:

Note: Each row of data actually consists of two parts, the line number and the number, looked like a number because the Hive table field separator is not a space or TAB key, but \001

(6) Use Sqoop to export temporary table data to MySQL corresponding table

Create a table in MySQL that corresponds to item_20141216

CREATE TABLE ' Item_contribute ' (

' itemId ' varchar (one) default NULL COMMENT ' entry id ',

' Nums ' int (one) default NULL COMMENT ' number of occurrences '

) Engine=myisam DEFAULT Charset=utf8;

Enter the Sqoop client to execute the command to enter the connection information:

Sqoop Export--connect jdbc:mysql://master:3306/test--username hive--password 123456--table item_contribute--fields- Terminated-by ' \001 '--export-dir '/user/hive/warehouse/db1.db/item_20141216 ';

You will see that the data in the MySQL table is populated:

(7) Next, you can use the Java program to organize the data into a report to show!

Hadoop2.20+hive+sqoop+mysql Data Processing cases

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More