ETL Learning Finishing PostgreSQL

Source: Internet
Author: User
Tags postgresql

ETL is "Extract"," Transform","Load" Three words initials that is "extract ","convert " ," Loading ", but we are often referred to as data extraction for the day.

ETL is the core and soul of BI/DW (Business intelligence/Data Warehouse), integrating and improving the value of data according to the unified Rules , is responsible for the completion of data from the data source to the target data Warehouse conversion process, is an important step in implementing a data warehouse.

The ETL contains three aspects:

extract : Read the data from a variety of original business systems, which is the premise of all work.

" conversion ": According to the pre-designed rules will be extracted data conversion, so that the original heterogeneous data format can be unified together.

Mount : Imports the converted data into the Data Warehouse on a schedule, incrementally, or all.

ETL is the process that the data of business system is pumped into the data warehouse after being cleaned and transformed, the purpose is to integrate the data of the enterprise in the scattered, messy and standard, and provide the analysis basis for the enterprise's decision. ETL is an important part of BI project. Typically, in BI projects, ETL spends at least 1/3 of the time of the project, and the ETL design is directly connected to the success or failure of the BI project.

ETL design is divided into three parts: data extraction, data cleaning and transformation, data loading. In the design of ETL, we also start from these three parts. The extraction of data from various data sources to the ODS (operational data store, operational storage)-This process can also do some data cleaning and conversion, in the process of extraction need to select different extraction methods, as far as possible to improve the efficiency of ETL operation. ETL three parts, takes the longest time is "T" (Transform, cleans, transforms) the part, generally this part of the workload is the entire ETL 2/3. Data loading is usually done directly after the data has been cleaned and written to the DW (data warehousing, warehouse).

ETL implementation has a variety of methods, commonly used there are three kinds. One is implemented using ETL tools such as Oracle's OWB, SQL Server 2000 DTS, SQL Server2005 SSIS Services, Informatic, and so on, one in SQL and the other with ETL tools and SQL. The first two methods have their own advantages and disadvantages, with the help of tools can quickly set up ETL project, shielding the complex coding tasks, improve the speed, reduce the difficulty, but lack of flexibility. The advantage of SQL method is flexibility, improve ETL running efficiency, but the coding is complex, the technical requirements are higher. The third is to synthesize the advantages of the preceding two, which will greatly improve the development speed and efficiency of ETL.

  I. Extraction of data (Extract)

This part needs to do a lot of work in the research phase, first of all to understand the data is from several business systems, the various business Systems Database server running what DBMS, whether there is manual data, how much manual data, whether there is unstructured data, etc. When this information is collected, data extraction can be designed.

1. The same data source processing method as the database system storing DW

This type of data source is relatively easy to design. In general, the DBMS (SQL Server, Oracle) provides a database link function that establishes a direct link between the DW database server and the original business system and can be accessed directly by writing a SELECT statement.

2, for the DW database system different data source processing method

For this type of data source, it is generally possible to establish database links through ODBC, such as between SQL Server and Oracle. If you cannot establish a database link, you can do it in two ways, by exporting the source data to a. txt or. xls file through a tool, and then importing the source system files into the ODS. Another way to do this is through a program interface.

3. For file type data sources (. txt,.xls), you can train business people to use database tools to import the data into the specified database and then extract it from the specified database. Or it can be implemented with tools.

4. Issues with incremental updates

For systems with large data volumes, incremental extraction must be considered. In general, the business system will record the time the business takes place, we can use it to make an incremental flag, determine the maximum time recorded in the ODS before each extraction, and then go to the business system at this time to take all the records that are greater than this time. With the time stamp of the business system, in general, the business system has no or part of the time stamp.

Second, the data cleaning conversion (cleaning, Transform)

In general, the data Warehouse is divided into two parts: ODS and DW. It is common practice to clean from the business system to the ODS, to filter out dirty and incomplete data, to convert from ODS to DW, to calculate and aggregate business rules.

  1. Data cleaning

The task of data cleansing is to filter out the data that does not meet the requirements, and give the result of the filter to the business competent department to confirm whether it is filtered or modified by the business unit before extracting it.

The non-conforming data mainly consists of incomplete data, wrong data and repeated data.

(1) Incomplete data: This type of data is mainly some of the information should be missing, such as the name of the supplier, the name of the branch, the customer's regional information is missing, the main table in the business system does not match the schedule. For this kind of data filtering out, according to the missing content written to different Excel files to submit to the customer, required to complete within the specified time. The Data warehouse is not written until the completion is complete.

(2) Wrong data: This type of error occurs because the business system is not sound enough, after receiving input is not judged directly to the background database, such as numeric data output complete corner numeric characters, string data followed by a carriage return operation, date format is incorrect, date out of bounds and so on. This type of data is also classified, for similar to full-width characters, the data is not visible before and after the problem, can only be written in the form of SQL statements to find out, and then ask the customer after the business system is modified after the extraction . Date format is incorrect or the date out of bounds of this kind of error will cause the ETL run failure, this kind of error needs to go to the Business System database by SQL to pick out, to the business departments to request the deadline correction, corrected and then extracted.

(3) Duplicate data: For this type of data-especially in a dimension table-export all the fields of a repeating data record to be confirmed and collated by the customer.

Data cleansing is a recurring process that cannot be completed within a few days, only to constantly identify problems and solve problems. For filtering, whether the correction is generally required to confirm the customer, for the filtered data, write to Excel file or write filtering data to the data table, in the early stage of ETL development can be sent to business units every day to filter the data of the mail, prompting them to correct the error as soon as possible, but also can be used as the basis for future verification data. Data cleaning needs to be noted is not to filter out useful data, for each filtering rule carefully verified, and to be confirmed by the user.

 2. Data conversion

The task of data transformation is mainly inconsistent data conversion, data granularity conversion, and some business rules calculation.

(1) Inconsistent data conversion: This process is an integrated process, the same type of data of different business systems unified, than the same as a vendor in the billing system is XX0001, and in the CRM code is YY0001, so that after extracting the unified into a code.

(2) Data granularity conversion: The business system generally stores very detailed data, and data warehouse data is used for analysis, do not need very detailed data. In general, business system data is aggregated according to the granularity of the Data warehouse.

(3) the calculation of business rules : Different enterprises have different business rules, different data indicators, these indicators are sometimes not simple plus minus minus can be completed, this time need to be in the ETL of these data indicators are stored in the Data Warehouse, for analysis use.

Third, ETL log, warning sent

1. ETL Log

ETL logs fall into three categories.

One is the execution of the process log, this part of the log is in the ETL execution process every step of the record, record each run each step of the start time, affecting the number of rows of data, Journal form.

One is the error log, when a module fails to write the error log, record the time of each error, the wrong module, and the error message.

The third type of log is the overall log, which records only the ETL start time and the end time for successful information. If the ETL tool is used, the ETL tool automatically generates some logs, which can also be used as part of the ETL log.

The purpose of logging is to be ready to know the ETL running situation, if the error, you can know where the error.

2. Warning Send

If the ETL error, not only to form an ETL error log, and to send a warning to the system administrator. There are many ways to send a warning, which is commonly used to send a message to the system administrator and attach the error message to facilitate the administrator to troubleshoot the error.

  ETL is a key part of BI project, is also a long-term process , only to constantly identify problems and solve problems, to make ETL run more efficient, for BI project late development to provide accurate and efficient data.

Postscript

As a data warehouse system, ETL is the key link. Said Big,ETL is a data integration solution, said small, is to pour data tools . Recalling the work for so long time, the processing of data migration, conversion is really a lot of work. But those jobs are basically a one-time job or a small amount of data. However, in the Data Warehouse system, ETL has risen to a certain degree of theoretical height, and the original use of the tool is different. What is different, from the name can be seen, people have been the process of the data is divided into 3 steps, E, T, l respectively represents the extraction, conversion and loading.

In fact, ETL process is the process of data flow, from different data sources to different target data. But in the data warehouse.

Ii. Available ETL for PostgreSQL

1.Benetl is a free ETL tool for the PostgreSQL database and also supports MySQL. Used to extract data from including CSV, TXT, and Excel files for conversion and import into the database.

Basic Introduction to 2.Kettle PostgreSQL operations

The ETL (extract, transform, load) tool is a tool for operations such as database data migration cleaning processing.

ETL Learning Finishing PostgreSQL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.