Kettle Series -3.kettle read database repository very slow optimizations

Source: Internet
Author: User

Environment: WINDOWS7,JVM Memory settings 14g,kettle5.1 later upgraded to 5.4,oracle as a repository.

Problem background: We manage the kettle job run through the Web page, this is just a management interface, even if the Web project is stopped will not affect the operation of the job, the actual operation of the job is the background program, with the number of jobs increased, up to three hundred or four hundred, The job is also running at an unacceptable rate.

Scenario 1:

In response to the problem, the test found that the job will no longer be read from the repository once it is run (for a scheduled job), and the conversion from the job will be read back from the repository each time. I found Org.pentaho.di.job.entries.trans.JobEntryTrans. This class, which is a job control that represents a transformation, debugging the trace code will find that he does re-read the repository load transformation every time it runs, because Kettle is cloned every time it runs control, and then reload the runtime, or debug the tracking code to better understand, for why he cloned, I do not know, of course, do not dare to tamper with the relevant code, estimated because there are some state attributes.

The idea of solving the problem is to try to solve the problem from the bottom, the specific problem point, the first thought to optimize the specific read conversion process, but not to try, but instead of a more targeted, simpler way, in the Jobentrytrans this class built a static map, for the cache read the conversion, This cache life cycle is similar to the corresponding job, because each time the Jobentrytrans control is read from the database, the converted cache referenced by the control is cleared, and if more than one job references the conversion, his life cycle is shorter than the job.

This scheme needs to change the amount of code is very small, read the conversion, first look at the cache, not read the repository, then cache, the next time directly with the cache, followed by the Jobentrytrans class read the library of methods in addition to clear the corresponding conversion cache code.

Cache?? In what form is it cached? In order to minimize the effect on the original logic I first cache the XML, is called the Transmeta Getxml method, the next time to read the XML directly, this way in the test environment is not a problem, but in the formal environment there is always a conversion problem, running abnormal, the formal environment is not easy to debug, So I have to directly cache Transmeta object, each time directly use, which is certainly more simple to write code, but this object in the end there is something, I do not have the energy to carefully analyze, holding a try to achieve the attitude, after testing the effect is very good, the cache XML problem does not exist, After a period of operation, did not find any big problem, is the log feel a bit of a problem, the program is still in use, basically did not affect the kettle to complete its work.

This code is not a small amount, just paste it out, is to modify the kettle5.4 in the Transmeta of this class two methods:

//Load the jobentry from repository//   Public voidLoadrep (Repository Rep, Imetastore Metastore, ObjectId id_jobentry, list<databasemeta>databases, List<SlaveServer> slaveservers)throwskettleexception {Try {
//................. The source code here is consistent with the official.
passingallparameters= Rep.getjobentryattributeboolean (Id_jobentry, "Pass_all_parameters",true ); if(Transmetamap.containskey (Getdirectory () + "/" +getName ())) {Logbasic ("The conversion has been cached, remove the cache immediately:" + getdirectory () + "/" +getName ()); Transmetamap.remove (Getdirectory ()+"/"+getName ()); } } Catch(kettledatabaseexception DBE) {Throw NewKettleexception ("Unable to load job entry of type ' trans ' from the repository for id_jobentry=" +Id_jobentry, DBE); } } PublicTransmeta Gettransmeta (Repository Rep, Imetastore Metastore, Variablespace space)throwskettleexception {Try{Transmeta Transmeta=NULL; Switch(specificationmethod) { CaseFILENAME:LongStart =NewDate (). GetTime (); String filename=Space.environmentsubstitute (GetFileName ()); Logbasic ("Loading transformation from XML file [" + filename + "]" ); Transmeta=NewTransmeta (filename, Metastore,NULL,true, This,NULL ); Log.logbasic (Transmeta.getname ()+ ", read conversion from file time consuming:" + (NewDate (). GetTime ()-start)); Break; CaseRepository_by_name:if(Transmetamap.containskey (Getdirectory () + "/" +getName ())) {Logbasic ("The conversion has been cached, directly using the cache:" + getdirectory () + "/" +getName ()); Transmeta= Transmetamap.get (getdirectory () + "/" +getName ()); }Else{String Transname=Space.environmentsubstitute (Gettransname ()); String realdirectory=Space.environmentsubstitute (Getdirectory ()); Logbasic (Basemessages.getstring (PKG,"JobTrans.Log.LoadingTransRepDirec", Transname, realdirectory)); if(Rep! =NULL ) { // //It is makes sense to try-to-load from the repository when the//repository is also filled in. // //It reads last revision from the repository. //Repositorydirectoryinterface repositorydirectory =rep.finddirectory (realdirectory); Transmeta= Rep.loadtransformation (Transname, Repositorydirectory,NULL,true,NULL ); Transmetamap.put (Getdirectory ()+"/"+getName (), Transmeta); Logbasic ("Get the transform from the repository and cache:" + getdirectory () + "/" +getName ()); } Else { Throw NewKettleexception (basemessages.getstring (PKG, "JobTrans.Exception.NoRepDefined" ) ); } } Break; Caserepository_by_reference:if(Transobjectid = =NULL ) { Throw Newkettleexception (basemessages.getstring (PKG,"JobTrans.Exception.ReferencedTransformationIdIsNull" ) ); } if(Rep! =NULL ) { //Load The last revision//Transmeta = Rep.loadtransformation (Transobjectid,NULL ); } Break; default: Throw NewKettleexception ("The specified object location specification method '" + Specificationmethod + "are not yet s Upported in this job entry. " ); } if(Transmeta! =NULL ) { //Copy Parent variables to this loaded variable space. //Transmeta.copyvariablesfrom ( This ); //Pass repository and Metastore references//Transmeta.setrepository (Rep); Transmeta.setmetastore (Metastore); } returnTransmeta; } Catch(Exception e) {Throw NewKettleexception (basemessages.getstring (PKG, "JobTrans.Exception.MetaDataLoad"), E); } }

Scenario 1 summarizes:

1. The program solves the problem of the conversion of repeated loading, after one load, the cache, and gives the mechanism to clear the cache reload, the actual use effect is: The first run is still very slow, but then the general feeling of flying, no data before every time to run 10 minutes, now only a few seconds, Proving that this problem is the only reason for the slow running of kettle, where we have not solved him perfectly.

2. The first operation of the program is still slow, logging may also have a problem, the timeliness requirements, it may not be the solution to completely solve the problem, because the restart daemon, the job delay may still reach one hours, of course, just restart that thing, see if your business can accept.

Scenario 2:

The high latency caused by restarting the daemon in Scenario 1 was still intolerable in our business and began to think about other solutions.

By trying to solve this problem, it would not be feasible to optimize the process of reading the transformation from the database by improving the kettle, because the process of kettle reading the repository is complex and involves a lot of logic (look closely at the relevant code and test the time-consuming of each key operation, After analyzing the time-consuming operation of the test, it is difficult to optimize it, and after testing, the file repository read conversion is nearly 100 times times faster than the database repository, so it is decided to solve the following methods:

1. Create job and so on without any changes, all operations on the database repository.

2. The background job also uses the database repository when getting some job information, but when the job is run in the last step, it gets the job in the file repository and runs it, which requires that when the job starts, two repository-related jobs are the same, especially the job name and path, which is the way to confirm the same job. The actual operation is the job in the file repository.

3. This involves the synchronization of the file repository with the database repository:

1) Manually export the database repository to the specified file resource library.

2) page control job start and stop, only when the page request to start the job when the background automatically sync the job information to the file repository (to be implemented), other places do not make any changes, even if the daemon restarts, do not synchronize the job, In this way, restarting the daemon will not occur for a few 10 minutes when the job is not running.

3) When we modify the job, confirm to run, this time on the page to stop the job, and then start to achieve synchronization.

The above is only the problem to solve the idea, the implementation of the implementation of a approximate, the resource pool synchronization still have problems, through the above way does not solve the database resource library reading slow problem, this need to kettle have more in-depth understanding of the technology Daniel to solve.

The result of this solution:

1. Fixed a problem with high latency during background program restart (main problem).

2. The pressure on the database repository is mitigated indirectly, so that we can create a change job faster.

3. A backup of the resource pool.

4. The scheme of synchronizing the repository is given, in fact, the file repository is very similar to the above cache, but the cache is more thorough.

Kettle Series -3.kettle read database repository very slow optimizations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.