The practice of data Warehouse based on Hadoop ecosystem--etl (i)

Source: Internet
Author: User
Tags dba hadoop ecosystem sqoop

first, the use of Sqoop data extraction
1. Sqoop Introduction

Sqoop is a tool for efficiently transferring large volumes of data between Hadoop and structured data storage, such as relational databases. It was successfully hatched in March 2012 and is now the top project of Apache. Sqoop has SQOOP1 and Sqoop2 two generations, and the final stable version of SQOOP1 is 1.4.6,SQOOP2 the last version is 1.99.6. It is important to note that 1.99.6 is not compatible with 1.4.6, and so far, 1.99.6 is not perfect and is not recommended for deployment in a production environment.
(1) SQOOP1

The architecture diagram for the SQOOP1 is shown below.


The design goals for the first generation of Sqoop are simple:
    • Import and export data between enterprise data warehouses, relational databases, document systems, and Hive, HDFs.
    • A client-based model.
    • The connector uses the driver supplied by the manufacturer.
    • There is no centralized meta-data store.
    • Only map tasks, no reduce tasks. Data transfer and conversion are provided by Mappers.
    • You can use Oozie to schedule and manage sqoop jobs.
SQOOP1 is a Java-developed, fully client-side driver that relies heavily on JDBC and can import and export data using simple command-line commands. For example:
# import
Sqoop Import--connect jdbc:mysql://localhost/testdb--table person--username Test--password * * * *
The above command forms a series of tasks:
    • Generate the SQL code.
    • Executes the SQL code.
    • Generate a map job.
    • Perform a map job.
    • Data transfer to HDFs.
# Export
Sqoop export--connect jdbc:mysql://localhost/testdb--table CLIENTS_INTG--username Test--password * * *--export-dir/u Ser/localadmin/clients
The above command forms a series of tasks:
    • Generate a map job.
    • Perform a map job.
    • Data is transferred from the/user/localadmin/clients path of HDFs.
    • Generate the SQL code.
    • Inserts data into the CLIENTS_INTG.
SQOOP1 has many easy-to-use features, such as the ability to import directly into hive, HDFs, or hbase at the command line. Connectors can connect to most popular databases: Oracle, SQL Server, MySQL, Teradata, PostgreSQL, and more.
The main problems of SQOOP1 include:
    • A wide variety of command-line arguments.
    • Unsafe connections (write passwords directly on the command line, etc.).
    • There is no metadata store and can only be configured and managed locally, making multiplexing restricted.
(2) Sqoop2

The architecture diagram for the SQOOP2 is shown below.


The SQOOP2 architecture is much more complex than SQOOP1 and is designed to solve SQOOP1 problems.
Ease of Use
SQOOP1 requires client installation and configuration, while SQOOP2 is installed and configured on the server side. This means that the connector is configured only in one place, managed by the Administrator role, and used by the operator role. Similarly, you only need to configure the JDBC driver and database connections on a single server. SQOOP2 also has a web-based service: The front end is the command line interface (CLI) and the browser, and the backend is a meta-data repository. Users can import and export through the Web interface, avoiding error options and cumbersome steps. In addition, SQOOP2 also integrates hive and hbase on the server side. Oozie manages Sqoop tasks through the rest API so that when a new Sqoop connector is installed, you do not need to install it in Oozie.
Scalability
In Sqoop2, the connector is no longer limited to the JDBC Vocabulary (database, table, etc. must be specified), and it can even define its own vocabulary. For example, Couchbase does not need to specify a table name and simply overloads it during a fill or unload operation. The common functionality is extracted from the connector, making it only responsible for data transfer. Common functionality is implemented in the reduce phase to ensure that the connector can benefit from future functional development. The connector no longer needs to provide downstream functions such as integration with other systems, so the connector developer no longer needs to know all the features supported by the Sqoop.
Security
Currently, the user runs Sqoop by executing the ' sqoop ' command. The security of the Sqoop job is primarily determined by the user trust that executes the sqoop. SQOOP2 will act as an application-based service that supports secure access to external systems by connecting objects by different roles. For further security, SQOOP2 no longer allows generation of code, requests direct access to hive or hbase, and does not open access to all clients for running jobs. Sqoop2 the connection as a level object. Once a connection that contains a certificate is generated, it can be used multiple times by different import export jobs. The connection is generated by the administrator and is used by the operator, thus avoiding the overflow of the end user's privileges. In addition, connections can be restricted to only some basic operations, such as import and export. Manage resources by restricting the total number of open connections at the same time and an option to disallow connections.

2. Sqoop in the CDH 5.7.0
CDH 5.7.0 in Sqoop contains both SQOOP1 and SQOOP2,SQOOP1 versions are 1.4.6,SQOOP2 versions are 1.99.5. The current SQOOP2 also lacks some of the features of SQOOP1, so Cloudera's recommendation is to use it only if Sqoop2 fully satisfies the desired feature, or to continue using SQOOP1. The characteristics of SQOOP1 and SQOOP2 in CDH 5.7.0 are distinguished as shown in the following table.

Characteristics

Sqoop1

Sqoop2

Connectors for all major RDBMS

Support

Not supported

WORKAROUND: Use a generic JDBC connector that has been tested on Microsoft SQL Server, PostgreSQL, MySQL, and Oracle databases.

This connector should be available on any JDBC-compatible database, but performance is less than SQOOP1 's dedicated connector.

Kerberos Consolidation

Support

Not supported

Data transferred from an RDBMS to hive or HBase

Support

Not supported

Workaround: Use the following two-step method.

1. Data import from RDBMS to HDFs

2. Manually import the data into hive or hbase using the appropriate tools or commands (such as the load data statement from hive).

Data transfer from hive or HBase to RDBMS

Not supported

Workaround: Use the following two-step method.

1. Extracting data from hive or HBase to HDFs (text file or Avro file)

2. Use Sqoop to import the output from the previous step into the RDBMS

Not supported

Workarounds such as SQOOP1.


3. Extracting Data using Sqoop
In this example, you use SQOOP1 to extract data from the MySQL library to hive.
There are two ways to import data from the source to the Data Warehouse (RDS in this example), you can fetch the data from the source (pull), or you can request the source to send (push) the data to the Data warehouse. An important factor affecting the choice of data extraction method is the availability and volume of the operational system, based on whether the entire data is extracted or only the change data since the last extraction. Consider the following two questions:
    • What parts of the source data need to be pumped into the data warehouse? There are two ways to fully extract and change data capture.
    • What is the direction of data extraction? There are two ways, pull mode (pull from Data Warehouse) and push mode (push through source).

Full decimation and change data Capture (CDC)

If the amount of data is small and easy to handle, generally take a full source data extraction (extracting all the file records or all of the database table data into the Data Warehouse). This approach is suitable for reference types of source data, such as postal codes. Reference source data is typically the source of a dimension table. If the amount of source data is large, extracting all the data is not feasible, then only the changed source data (the data that has changed since the last extraction) can be extracted. This data extraction pattern, known as change data Capture (CDC), is typically used to extract transactional data for an operational system, such as a sales order.

The CDC can be broadly divided into two types, one intrusive and the other non-intrusive. The so-called intrusive refers to the impact of the CDC operation on the performance of the source system. As long as the CDC operation executes the SQL statement in any one way, it can be considered an intrusive CDC. Of the four commonly used CDC methods, three are invasive, with four methods: Timestamp-based CDC, trigger-based CDC, snapshot-based CDC, log-based CDC. The following table summarizes the characteristics of the four CDC scenarios.

Time stamp Mode

Snapshot mode

Trigger mode

Log mode

Ability to differentiate inserts/updates

Whether

Is

Is

Is

Multiple updates detected during the cycle

Whether

Whether

Is

Is

can detect the deletion

Whether

Is

Is

Is

Not invasive

Whether

Whether

Whether

Is

Support Real-time

Whether

Whether

Is

Is

DBA Required

Whether

Whether

Is

Is

Do not rely on databases

Is

Is

Whether

Whether


Push data from source pull data or source
If you want the data source to simply wait for the data warehouse to be extracted, you can use pull mode. However, it must be confirmed that the source data must be available and ready for data to be extracted in the Data warehouse. If the real-time nature of the extracted data is important, or if you want the data source to be sent as soon as the data is ready, you should use the extraction mode that pushes the data from the data source. If the data source is protected and is forbidden, you can only use the data source to push the data.

The following table summarizes the source data tables and their extraction modes used by the dimension and fact tables in this example.

Time stamp Mode

Snapshot mode

Trigger mode

Log mode

Ability to differentiate inserts/updates

Whether

Is

Is

Is

Multiple updates detected during the cycle

Whether

Whether

Is

Is

can detect the deletion

Whether

Is

Is

Is

Not invasive

Whether

Whether

Whether

Is

Support Real-time

Whether

Whether

Is

Is

DBA Required

Whether

Whether

Is

Is

Do not rely on databases

Is

Is

Whether

Whether


(1) Overwrite import
For customer, product these two tables take the data in the form of an overall pull. ETL is usually performed periodically at a fixed time interval, so for the overall pull, each imported data needs to overwrite the data that was last imported. The Hive-overwrite parameter implementation is provided in Sqoop to overwrite the import. Another function of Hive-overwrite is to provide a power-like choice of operations. The so-called idempotent operation refers to the effect of any number of executions that occur in the same way as the effect of a single execution. This allows you to perform the operation again after the import fails or fix the bug, without worrying about the system changes caused by repeated executions.
The specific commands are as follows:
Sqoop Import--connect jdbc:mysql://cdh1:3306/source?usessl=false--username root--password mypassword--table Customer--hive-import--hive-table rds.customer--hive-overwritesqoop import--connect Jdbc:mysql://cdh1:3306/source ? usessl=false--username Root--password mypassword--table product--hive-import--hive-table rds.product-- Hive-overwrite
(2) Incremental Import
Sqoop provides incremental import mode for importing only rows that are new than those that have been imported. The parameters in the following table are used to control incremental imports.

Parameters

Describe

--check-column (COL)

Specifies the column to be checked when determining which rows should be imported. (column should not be a Char/nchar/varchar/varnchar/longvarchar/longnvarchar data type)

--incremental (Mode)

Specifies how Sqoop determines which rows are new rows. Valid values are append and lastmodified.

--last-value (value)

Specifies the maximum value of the checked column for which the data has been imported


Sqoop supports two types of incremental imports: Append and LastModified. You can use the--incremental parameter to specify the type of an incremental import.
The append mode should be used when the new row of the imported table has a row ID value that is continuously incremented. Specifies a column with a row ID of--check-column. Sqoop imports rows of data that are checked for columns whose values are larger than those given by--last-value.
Another table modification policy supported by Sqoop is called LastModified mode. The lastmodified mode should be used when the data row of the source table may be modified, and each modification will update a last-modified column as the current timestamp. The timestamp of those checked columns is imported than the timestamp given by--last-value new data rows.
At the end of the incremental import, the--last-value used by subsequent imports will be printed. When you perform subsequent imports, you should specify the value of the--last-value parameter in this manner to ensure that only new or modified data is imported. This process can be performed automatically with an incremental import of a save job, which is a good way to perform incremental imports repeatedly.

with a basic understanding of Sqoop incremental import, here's a look at how you can use it to extract data in this example. For Sales_order This table uses a timestamp-based CDC pull method to extract data. This assumes that the sales order record in the source system does not change once it is in storage, or you can ignore the change. This means that the sales order is a table that changes the one-way append data over time. The Sales_order table has two fields about time, order_date represents the order time, Entry_date represents the time the order data is actually inserted in the tables, and when you discuss "late facts" later, you see that two times may be different. Which field will be used as the CDC timestamp? Imagine such a scenario, the order time of a sales order is January 1, 2015, the actual insert Table time is January 2, 2015, ETL 0 points per day to extract data from the previous date. If data is extracted by order_date, the condition is where order_date >= ' 2015-01-02 ' and Order_date < ' 2015-01-03 ', The ETL executed on January 3, 2015 0 o'clock does not capture this new order data. Therefore, the entry_date should be used as the CDC timestamp.

Test the incremental import below:
1) Perform the initial full-volume import first, as the basis of the increment
Sqoop Import--connect jdbc:mysql://cdh1:3306/source?usessl=false--username root--password mypassword--table Sales_ Order--hive-import--hive-table Rds.sales_order--hive-overwrite
2) Establish Sqoop incremental Import job
Sqoop Job--create myjob_1--Import--connect "Jdbc:mysql://cdh1:3306/source?usessl=false&user=root&password =mypassword "--table sales_order--columns" Order_number, Customer_number, Product_code, Order_date, Entry_date, Order_ Amount "--where" Entry_date >= Date_add (current_date (), Interval-1 Day) and Entry_date < Current_date () "--hive-imp ORT--hive-table rds.sales_order--incremental append--check-column entry_date--last-value ' 1900-01-01 '
3) Add a piece of data to the source library
INSERT into Source.sales_order values (Null,7,3,date_add (Current_date (), Interval-1 Day), Date_add (Current_date (), Interval-1 day), 10000); commit;
4) Perform sqoop operations
Sqoop Job--exec myjob_1
5) query in the RDS Library of Hive
SELECT * from Sales_order ORDER by order_number Desc;
As shown in the results, you can see that the new data has been appended to the Rds.sales_order table.

This is a general way to extract data using Sqoop. Sqoop has a number of command line parameters, specific to the official documents, the link address is as follows: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

The practice of data Warehouse based on Hadoop ecosystem--etl (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.