Keywordsdata integration etl big data preprocessing
Background: Users can be obtained on different platforms. These users may be cross-duplicated. For example, you registered on platform A before, and then you registered on platform B. The structure of the table for storing data on different platforms may have different table fields. Most representatively, Meituan merged with Dianping, and the data of the two food delivery platforms must be integrated together in order to exert greater commercial value-data integration.
Generally speaking, the work of a data engineer includes the implementation of data ETL and
data mining algorithms. The implementation of the algorithm is understandable, that is, to find the "gold" from the data warehouse through data mining algorithms.
ETL implements the three processes of data extraction, conversion, and loading
Extraction is to extract data from existing data sources.
Conversion is to process the original data, such as connecting table input 1 and table input 2 to form a new table.
According to the sequence and location of conversion, data integration can be divided into two architectures: ETL and ELT.
The process of ETL is Extract-Transform-Load. After the data source is extracted, the transformation is performed first, and then the result of the transformation is written to the destination.
The ELT process is to extract (Extract)-load (Load)-transform (Transform), after extraction, the results are first written to the destination, and then use the aggregate analysis capabilities of the database or an external computing framework such as Spark to complete the conversion A step of.
The current mainstream architecture for data integration is ETL, but more and more people will use ELT as a data integration architecture in the future.
Compared with ETL, the biggest difference between ELT and ETL is "heavy extraction and loading, light conversion", so that a lighter solution can be used to build a data integration platform. On the one hand, it is more time-saving. On the other hand, ELT allows BI analysts to access the entire original data without restrictions, thereby making BI more flexible for data processing and better supporting business.
What are the ETL tools?
Open source software: Kettle, Talend, Apatar, Scriptella, DataX, Sqoop, etc.
Use of Kettle tools
Kettle is a foreign open source ETL tool, written in pure Java, can run on Window and Linux, and can be used without installation.
Before using Kettle, you also need to install database software and Java Runtime Environment (JRE)
Kettle operates in a visual manner to migrate data between databases. It includes two scripts: Transformation and Job.
Transformation: It is equivalent to a container, which defines data operations. Data manipulation is a process from data input to output. In normal work, we will decompose the task into different assignments, and then decompose the assignment into multiple transformations.
Job (job): Compared with the conversion is a larger container, it is responsible for organizing the conversion to complete a certain job.
Transformation can be divided into three steps, which include input, intermediate transformation and output.
There are two main concepts in Transformation: Step and Hop.
Step: Step is the smallest unit of conversion. In the above conversion, four steps are included: table input, value mapping, removing duplicate records, and table output;
Hop (hop line): Used to connect Step during conversion. It represents the flow of data.
How to create a Job: A complete task is actually to connect the created conversion with the job. Here Job includes two concepts: Job Entry and Hop.
Job Entry (work entity): Job Entry is the internal execution unit of the Job. Each Job Entry is used to perform specific tasks, such as calling conversion, sending emails, etc.
Hop: Refers to the line connecting Job Entry. And it can specify whether to execute conditionally.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.