Hadoop2.20+hive+sqoop+mysql Data Processing case

Source: Internet
Author: User
Tags sqoop

I. Description of the Business

Using HADOOP2 and other open source frameworks, the local log files are processed and the data required after processing (PV, UV ... Re-import into the relational database (MYSQL), using Java programs to process the result data, organized into a report form in the data behind the display.

Second, why useHadoop (Development Motivation)

The existing log processing method, is the pure Java program to calculate, for big Data log file, this scheme in processing speed, IO occupation, data backup and resource utilization has a certain defect;

In contrast, using Hadoop to accomplish data storage and computing has the following benefits:

(1) High reliability: Hadoop supports multi-copy backup of data;
(2) High scalability: Hadoop distributes data between available computer sets and completes computing tasks, and these clusters can be easily extended to thousands of nodes.
(3) Efficiency: By distributing data, Hadoop can process them in parallel (parallel) on the node where the data resides, which makes processing very fast.
(4) High fault tolerance: Hadoop is able to automatically save multiple copies of data and automatically reassign failed tasks.

(5) High resource utilization: The administrator can set up different resource scheduling schemes (yarn) According to the current server configuration, so as to maximize resource utilization.

Third, data processing flow chart

Iv. Category Contribution rate case process

First, the case business objectives:

Statistics December 16, 2014 All day, the user clicks the number TOP20 Item Category entry

Ii. Development Steps

(1) Upload the December 16, 2014 log (approximately 16G) to the/input/log_2014-12-16 directory of the HDFs file system

(2), write the MapReduce program cleaning data

Program Logic Analysis:

Filter out the URL contained in the log field? Id= ... Line, which means that the user clicked on an entry with an ID for the item, and the goal of the program was to filter only the row data containing the URL and discard the rows that were not included. The cleaned data is stored in the/output/cleanitem_2014_12_16 directory of HDFs (approximately 140MB after cleaning)

(3), map the cleansed data to the hive table

To the client side of hive, create an external partition table Mmb_ordercontribute, which corresponds to the field one by one of the log, but one more data partition field, which is equivalent to the index in MySQL, and when queried by date, the efficiency is high. The following is a list of the Mmb_ordercontribute statements in hive:

, the map path location of the table points to the data directory after cleaning

(4) Manipulating Hive table data

After the data has been mapped, the table data can be processed in the form of SQL statements like MySQL, such as:

Count the total number of access entries for all users on the day:

The results of the statistics are: 904423

Although it is a MySQL-like query, but the process of processing and MySQL is the difference between the day, Hive will parse this statement into a MapReduce program to do statistics, hive processing the query cup to use the total time is about 6 minutes, and allocated to the logical CPU, The actual use time is 28.5 seconds, and if the parsing and preparation of MapReduce is removed, the execution time of MapReduce is very small.

But hive is not all SQL statements mapped to a MapReduce program, the only exception (not configured) is: SELECT * FROM table_name LIMIT, because this is only the first few data in the interception table;

(5) Processing cleaning data using hive and exporting to a temporary table

The data that is really needed will be drawn in this step, that is: the user clicks on the category of the Top20, and the data is counted as a temporary table of hive:

item_20141216 table storage, ready for use:

SQL statements:

CREATE TABLE

item_20141216

As SELECT

Itemid,count (1) as Nums

From

Mmb_ordercontribute

GROUP by

ItemId

ORDER by

Nums DESC;

The exported table item_20141216 is stored under Hive's default Data warehouse:

Note: Each row of data actually consists of two parts, the line number and the number, looked like a number because the Hive table field separator is not a space or TAB key, but \001

(6) Use Sqoop to export temporary table data to MySQL corresponding table

Create a table in MySQL that corresponds to item_20141216

CREATE TABLE ' Item_contribute ' (

' itemId ' varchar (one) default NULL COMMENT ' entry id ',

' Nums ' int (one) default NULL COMMENT ' number of occurrences '

) Engine=myisam DEFAULT Charset=utf8;

Enter the Sqoop client to execute the command to enter the connection information:

Sqoop Export--connect jdbc:mysql://master:3306/test--username hive--password 123456--table item_contribute--fields- Terminated-by ' \001 '--export-dir '/user/hive/warehouse/db1.db/item_20141216 ';

You will see that the data in the MySQL table is populated:

(7) Next, you can use the Java program to organize the data into a report to show!


Hadoop2.20+hive+sqoop+mysql Data Processing cases

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.