Kettle timed Execution (ETL tool)

Source: Internet
Author: User

1,kettle is used across platforms.
For example: Under Aix (AIX is the IBM commercial UNIX operating system, which is also applicable here in Linux/unix), the following steps for running kettle are as follows:
1 Enter the path to the kettle deployment
2 Execute chmod *.sh, add executable permissions to all shell files
3 in the kettle path, if you want to execute transformation, run./pan.sh-file=? Ktr-debug=debug-log=log.log
which -file describes the path of the transformation file you want to run;-debug describes the level of the log output;-log the path to the log output
4 in the same vein, for job execution, replace the./pan.sh with the./kitchen.sh, the other parts are unchanged.

2,kettle environment variables are used.
In transformation, the Core Objects-->job-->set Variables, can set environment variables, for the absolute path and the conversion of the relative path is very helpful, kettle cross-platform to a large extent rely on it

3, the use of other functions.
Other functions include DB stored procedure call, stream query, value mapping, aggregation record, etc.

4,kettle timing function.
Under the job of the start module, there is a timer function, can be daily, weekly, and other ways of timing, for the periodic ETL, very helpful.

A. When you log on using the resource pool (repository), the default username and password is admin/admin.

B. When a job is stored in a resource pool (a common repository uses a database), the following command line is used when you use Kitchen.bat to perform a job:
Kitchen.bat/rep kettle/user admin/pass admin/job Job name

C. When the job is not stored in the resource pool and is stored in the file system, use Kitchen.bat to execute the job using the following command line:
Kitchen.bat/norep/file USER-TRANSFER-JOB.KJB

D. Once you can use the command line to perform a job, you can use Windows or Linux Task Scheduler to perform tasks regularly.

E. If an exception statement occurs,

Unexpected error during transformation metadata load
No Repository defined!

Please follow the above procedure to exclude.



A Journal of 5,kettle Experience.
Kettle for the processing of the log, there is a bug, read the previous one may have seen my message, kettle for the log processing has a bug, the day more than 49M (not 50M, nor 49M), kettle will automatically stop, This point I did not find in the source of the corresponding settings and constraints, the reason is still not found, because the log did not write, so the reason is not good tracking also do not know the specific reasons.

the efficiency of 6,kettle is improved.
Kettle as an ETL tool, certainly can not avoid the problem of efficiency, when a large data source input, you will encounter the problem of efficiency. There are several solutions to this:
1 The database side creates the index. To query the database side of the field, create an index, can greatly improve the efficiency of the query, most of the time, I do not create an index, the average query 4 records a second, after the creation of the index, a second query 1300 records.
2 database query and flow query attention to the use of the environment. Because the database query enters a record for the data input, it makes a query to the target table. The stream query is to read the target table into memory, data input input data, internal from the query, so, when the input port for large data, and the amount of data is small query table (hundreds of records), you can use flow query, After all, the target table read into memory, the speed of the query will have a very large increase (memory read and write speed is hundreds of times times the hard disk, coupled with the conditions of the database itself, the speed of the impact will be greater). Similarly, for the target table is a large amount of data, or recommend the use of database queries, otherwise, all of a sudden hundreds of m of memory is dry, or very scary.
3 careful use of JavaScript script, because JavaScript itself is not high efficiency, when you use JS, you must consider each of your records, it is necessary to perform a JS time required.
4 Database commit times, one record and 100 records the effect of commit on efficiency is certainly different.
5 The form of the SQL statement entered in the table. Some people like to write all the associations in the form of a table, or from n multiple tables, or in to in, so that you have to face the problem I said in 2, need attention.
6 Note that the log output, such as the selection of database updates, and logging level is debug, then the background will desperately output log, will greatly affect the speed, here must be noted.

7, common debugging bugs.
Kettle provides a number of debugging solutions, but avoids common debugging bugs.
1 path problem. My most common problem is debugging successfully under Windows, but there are problems with deploying to UNIX and forgetting to turn windows down into UNIX.
2 The output side, the database inserts the update selection is not correct. Output end, provides three kinds of database output method, database output, insert/update, UPDATE, for these three, each have advantages and disadvantages, if you know the database output, is completely inserted, if there is duplicate data, it will error; Insert updates and updates, because when you update the data, the background output a lot of logs, which can be inefficient.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.