Pentaho work with Big data (vii)-extracting data from a Hadoop cluster

Source: Internet
Author: User
Tags hadoop fs

I. Extracting data from HDFS to an RDBMS
1. Download the sample file from the address below.
Http://wiki.pentaho.com/download/attachments/23530622/weblogs_aggregate.txt.zip?version=1&modificationDate =1327067858000

2. Use the following command to place the extracted Weblogs_aggregate.txt file in the/user/grid/aggregate_mr/directory of HDFs.
Hadoop fs-put weblogs_aggregate.txt/user/grid/aggregate_mr/
3. Open PDI, create a new transformation, 1.


Figure 1

4. Edit the ' Hadoop File Input ' step, as shown in Figure 4, 2.


Figure 2


Figure 3


Figure 4

Description
. PDI connects to the Hadoop cluster reference http://blog.csdn.net/wzy0623/article/details/51086821.
. Use tab as the delimiter character.

5. Edit the ' Table Output ' step, as shown in 5.


Figure 5

Description
. Mysql_local is a local MySQL database connection that has been built, as shown in setting 6.


Figure 6

. The database fields label does not need to be set

6. Execute the following script to build the MySQL table
Use Test;create table Aggregate_hdfs (    client_ip varchar), year    smallint,    month_num tinyint,    Pageviews bigint);

7. Save and perform the conversion, as shown in log 7.


Figure 7

As you can see from Figure 7, the transformation has been executed successfully.

8. Query the MySQL table, as shown in result 8


Figure 8

As you can see from Figure 8, the data has been extracted from HDFs into the MySQL table.

Ii. extracting data from hive to an RDBMS
1. Execute the following script to create a table for hive
CREATE TABLE weblogs (    client_ip    string,    full_request_date string,    day    string,    month    string,    month_num int,    year    string,    hour    string,    minute    string,    second    string,    timezone    string,    http_verb    string,    uri    string,    http_status_ Code    string,    bytes_returned        string,    referrer        string,    user_agent    string) row Format delimitedfields terminated by ' \ t ';
2. Download the sample file from the address below.
Http://wiki.pentaho.com/download/attachments/23530622/weblogs_parse.txt.zip?version=1&modificationDate= 1327068013000

3. Use the following command to place the extracted Weblogs_parse.txt file in the/user/grid/parse/directory of HDFs.
Hadoop fs-put weblogs_parse.txt/user/hive/warehouse/test.db/weblogs/
At this point, data 9 in the Hive table is shown.


Figure 9

4. Open PDI, create a new transformation, 10.


Figure 10

5. Edit the ' Table input ' step, as shown in 11.


Figure 11

Description: hive_101 is a hive database connection that has been built, as shown in setting 12.


Figure 12

Description: PDI connects Hadoop Hive 2, reference http://blog.csdn.net/wzy0623/article/details/50903133.

6. Edit the ' Table output ' step, as shown in 13.


Figure 13

Description
. Mysql_local is a local MySQL database connection that has been built, as shown in setting 6.
. The database fields label does not need to be set

7. Execute the following script to build the MySQL table
Use Test;create table aggregate_hive (    client_ip varchar (a), year    varchar (4),    month varchar (ten),    Month_num tinyint,    pageviews bigint);
8. Save and perform the conversion, as shown in log 14.


Figure 14

As you can see from Figure 14, the transformation has been executed successfully.

9. Query the MySQL table, as shown in result 15


Figure 15

As you can see from Figure 15, the data has been extracted from the hive database into the MySQL table.

Reference:
Http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HDFS+to+Load+an+RDBMS
Http://wiki.pentaho.com/display/BAD/Extracting+Data+from+Hive+to+Load+an+RDBMS

Pentaho work with Big data (vii)-extracting data from a Hadoop cluster

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.