The kettle of database data formatting Spoon

Source: Internet
Author: User

Objective

With the growing variety of databases and the increasingly complex format of database backups, data formatting has always been a commonplace issue. According to the library backup file format so many, both SQL, and Bak, as well as TXT and so on. There are many kinds of database, Mysql,oracle,sql server and so on, how to manage these databases? Yesterday leaked the access format of the database, today leaked the Excel format of the database, tomorrow, the release of the database SQL format. To format so many kinds of database, pressure mountain Ah! Why is it so complicated to build a local database?

Here comes the kettle. First of all, we must say Pentaho this enterprise. Pentaho is mainly committed to the analysis, collation and management of big data, and the tools developed by this company are open source! Yes, it's open source. Anyone can view the source code for this project and make changes and research on it. The company has developed many management tools or frameworks, and the most famous is kettle. Kettle is a foreign open source ETL tool, written in Java, can be run on Windows, Linux, UNIX, data extraction is efficient and stable. ELT is all called extraction, transformation Loading, wherein the text is interpreted as extracting, converting and loading. Kettle This tool contains SPOON,PAN,CHEF,ENCR and kitchen so five basic set up.

SPOON allows you to design the ETL conversion process (transformation) through a graphical interface.

PAN allows you to run ETL transformations designed by spoon in batches (for example, using a time scheduler). Pan is a background-executed program with no graphical interface.

CHEF allows you to create a task (job). Tasks are more conducive to automating the complex work of updating data warehouses by allowing each transformation, task, script, and so on. Tasks by allowing each transition, task, script, and so on. The task will be checked to see if it is running correctly.

KITCHEN allows you to bulk use a task designed by chef (for example, using a time scheduler). Kitchen is also a program that runs in the background.

ENCR This script is used to encrypt the password that is used to connect the database password with the cluster

Today we are mainly talking about the spoon, here will do a basic explanation, in order to achieve the role.

1. Basic Installation

Since kettle is written by Java code, the first thing to run kettle is to download and install the JDK and set up the environment variables. The kettle is the "portal".

After the download is complete, you will get a RAR compression package, please unzip the compressed package.

After the decompression is done in the file directory, you can see a few bat files and sh files, here is the way the Kettle tool opens.

Here is a common sense of popularity.

Users of Windows systems, open kettle with a bat file

Users of Linux system please open kettle with sh file

The main point here is a basic explanation of kettle spoon. Wait a few seconds after running Spoon.bat or spoon.sh to see the basic interface of Kettle spoon.

2. Basic modeling New Job

Before we start on data management, we need to create a new job. Click on the file and create a job in the new job option.

In the core object here, we can see a few basic functions and modules in kettle spoon.

Data entry

There are many backup formats for databases, such as Bak,sql,txt,csv and so on. Here need to have a basic management of these files, here is the database input module, we have to first set up their input module. The following figure is a basic module of several inputs on the kettle spoon.

Here's a basic tutorial on the input module, let's take a look at how to input the database backup file. You can edit the module by clicking the left mouse button and dragging it into the job box.

Suppose I'm going to put a leaked database into a txt format, the first thing you should look at is the delimiter and qualifier for the field.

Then drag a text file to enter the module into the job.

Double-click the module to set it up.

Then click Browse, select the database file you want to import, and click Add.

Fill in the delimiters at the content by the delimiter that already knows the database field. Here you can also set the text display format, qualifiers, and so on, in addition to the delimiter settings.

Then you get the table table. Within this setting, you can also edit the table names of table tables.

After the setup is complete, we need to preview the entire table to see if our settings are correct.

Of course, this is just a very simple txt unencrypted backup database, Kettlespoon also provides data input formats such as ACCESSS,SQL,CSV, and even provides database input to the database. For example, if you want to enter data from an Oracle database into a MySQL database, you may need to back up the data and then import it into the MySQL database. But in the kettle spoon, you can transfer directly in the data stream, eliminating the intermediate step directly. This feature can be used in the table input module.

Data processing

Through the above steps, the data has been successfully entered, but we also need to organize the data. such as adding sequences, adding fields, data encryption, and so on. Let's go on to give an example here. Because the local database has a huge amount of data, you need to index the data. The key to index data is the ID, each data needs a different ID to index, and the value of the ID can not be a normal int type, must be the bigint type. There is a need for a large amount of data for a standardized collation and processing.

Assuming that I have 199 data in my database, I need to redo an ID sequence for each new data, with an ID starting value of 200 and an additional two numbers for the ID value between each piece of data. Although the feeling is very complicated, but kettle spoon can easily solve this problem.

Drag an Add sequence module from the conversion to the job, then left-click the text file input module, and hold down the SHIFT key while dragging the mouse in any direction, this time you will see a line. Link it to the Add sequence module.

Double-click the Add Sequence module to enter the Settings screen, and set the field name to newid in the name of the value. Since the starting value of ID is 200, the starting value is set to 200. Because the ID between each data is incremented by two numbers, the growth is set to 2 according to this one. Finally click OK.

The setting for such an increment sequence is complete. Of course, this example is just a very simple function, there are a lot of data processing functions waiting for everyone to try.

Data output

The data input also has, the processing step also has, then the data is output. There may be some requirements for the data output, such as I don't want some fields, or I want to format the data. The kettle here can help you. Kettle even provides a variety of data output modules, you can check its output menu.

Let's continue with the example we've just given. Suppose the new data I just just the newly generated id,usename and password fields, the other I do not, and the data to output to Excel table, and the font of the data is Arial, size is 10.

Then we can build an Excel output module, then the module needs to be connected with the previous Add sequence module.

Then double-click the Excel module to set it up. Browse there to select the Save path for Excel.

Then set the font format for Excel here in the format and so on.

Then in the field select Get field First, then delete the unwanted fields, and sort the fields.

Then click Start in the top left corner to start the output of the data

is the collated data. You can see that the ID of the data starts at 200 and increases by two numbers per piece of data. Both field requirements and font formatting requirements have been met. Because of my own Excel Setup problem, all the numbers behind will have two small data points, this does not affect.

Summarize

Actually just showed a very basic kettle data modeling, and more complex, I show you

Multiple backup databases entered in a single Excel file

Most data is synced online and encrypted to a remote server

Single file adds constants and sequences and synchronizes to multiple databases simultaneously

Kettle is very strong, and this article is only about kettle spoon, there are pen,chef and so on. This article is mainly to make a contribution, so that everyone can understand and preliminary use of kettle this tool. Kettle not only can do basic data warehousing and collation, you can even write your own data management module, script, and used in Kittle, such as data matching module, data distribution module.

The kettle of database data formatting Spoon

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.