Introduction to ETL technology: Introduction to ETL, data warehouse, and etl Data Warehouse
ETL is the abbreviation of Extract-Transform-Load. It is used to describe the process of extracting, transforming, and loading data from the source to the target. ETL is commonly used in data warehouses, but its objects are not limited to data warehouses. What is ETL?
ETL is the abbreviation of the three English words "Extract Transform Load". The Chinese meaning is extraction, conversion, and loading. Speaking of ETL, you must mention the data warehouse.
Background Knowledge:
Information is an important resource of modern enterprises. It is the basis for enterprises to use scientific management and decision analysis. At present, most enterprises spend a lot of money and time building online transaction processing OLTP business systems and office automation systems (such as various operation support systems in the telecom industry and shopping website systems ), used to record various data related to transaction processing. According to statistics, the data volume is every 2 ~ In three years, the data will multiply. This data contains huge commercial value, and enterprises generally only focus on 2% of the total data volume ~ About 4%. Therefore, enterprises still do not maximize the use of existing data resources, which wastes more time and money and loses the best chance to make key business decisions.
In this context, a strategic set of all types of data that can be supported by the decision-making process at all levels of an enterprise emerged as a data warehouse. Data Warehouse is short for Data Warehouse. The data warehouse integrates the data generated by the OLTP system to explore the commercial value and provide decision support. For example, in the telecom industry, China Telecom has such systems as information about customer complaints, broadband groups, and recovery time records of customer numbers every day. The data is stored in the production environment system. Each month, they integrate the data into a data warehouse and process the data into a report. One of the functions is to make a rough judgment on the user's tendency to have an off-net sales number. This is the value of the data warehouse.
So how can we get data into a data warehouse? One of the technologies used is ETL.
The following is a detailed definition of ETL:
As the core and soul of DW, ETL (Extract-Transform-Load) can be integrated according to unified rules and improve the value of data, it is responsible for the process of converting data from the data source to the target data warehouse, and is an important step for implementing the data warehouse. If the model design of the data warehouse is the blueprint of a building and the data is brick, ETL is the process of building. The most difficult part of the project is the user requirement analysis and model design, while the ETL rule design and implementation are the largest workload, accounting for about 60% of the total project ~ 80%. This is a general consensus obtained from many practices at home and abroad.
ETL is a process of data extraction (Extract), Cleaning (Cleaning), transformation (Transform), and Load (Load. It is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data to the data warehouse according to the pre-defined data warehouse model.
Therefore, how enterprises use various technical means and convert data into information and knowledge has become the main bottleneck for improving their core competitiveness. ETL is a major technical means.
As a data warehouse system, ETL is a key link. If it is big, ETL is a data integration solution. If it is small, it is a tool for data dumping.
Tools used by ETL technology include Informatica, Datastage, Beeload, and Kettle. Currently, kettle is used only, so kettle is described here.
Kettle is a foreign open-source ETL Tool written in pure java and can be run on Windows, Linux, and Unix. kettle Version 3 and above must be installed in green without installation.
Note that kettle requires a JRE environment for the machine to run.
Kettle is an ETL tool set that allows you to manage data from different databases and describe what you want to do by providing a graphical user environment.
Kettle has two types of script files: transformation and job. transformation completes basic data conversion, and job controls the entire workflow.
Now let's take a look at kettle's transformation file, the simplest e process example (in windows)
The file function is to extract data from one table in the oracle database to another table.
The icon on the left is the green line in the middle of the output table on the right of the input table. You can drag the table input and table output from the left-side menu bar.
Double-click the table and enter the following information:
Open the table input interface on the left
Step name: Enter the name shown below the icon to obtain the SQL query statement: after you click it, the table view of oracle is displayed in a tree. After you double-click it, you will be added to the blank SQL box.
Database Connection: you do not need to create a new database at the beginning. You can edit the database. After you click Edit, the page on the right is displayed.
Connection name: the connection name can be used to add the data IP address and the Instance name connection type: select the database type you want to query mysql oracle, etc.
Method access: select the JDBC driver type. The server host name is the IP address of the database.
Dbname: database instance name
Port number: indicates the Port number.
The user name and password are shown below.
After filling in all the information, you can click test to test whether the connection can be successful. After successful connection, click "OK" and the page will return to the left figure above.
The empty box below the SQL statement is used to write the SQL statement you want to obtain the data (or you can use the "get query SQL" button to automatically obtain the data). After writing it, you can click Preview (select fewer lines) check whether the data is normal. If you can preview the data, your table input is configured.
After the ennable lazy conversion on the left is checked, the conversion is delayed. In this way, the original character set can be maintained for the data read from the database, and UTF-8 is not forcibly used by default. Replace variables in: indicates that if the SQL statement in the SQL box contains a variable, the environment variable will be used to replace it with a fixed date format table name like table_201407. You can use this to limit the number of automatic records: the default value is 0. If it is set to any value greater than 0, no matter how SQL writes the input table, there are only so many set rows.
Now let's look at the table output:
Double-click on the left to open the table output page. Step name: name shown below the table output icon available database ip user name table name database connection: No need to create a few new can be created to edit target schema: target table or user target table of the target table or view: You can enter it by yourself or select the number of submitted records from the browser: The quantity of data submitted at a time or the limit value of the quantity of data inserted in a batch is cropped. Table: there is a truncate operation before the insert. Ignore insert errors: This is a non-Batch insert function. If you insert data in batches for one day, the data following the error can be inserted normally. Batch insert: Click to batch insert. Otherwise, click batch insert. After you click "edit database connection", the tns information and user name and password of the database are entered in the same way as in the table. Click test to connect and click OK.
Return to the left-side interface and click "OK" to configure a conversion. Click "run:
It can be seen from the log that the conversion has finished, the time consumed, speed, average number of rows per second, total number of records inserted, etc.