Recently tutor to the younger sister to do a training on kettle, instantly embarrassed, kettle I also just learned, even fur are not, and, last use kettle already last year's thing ...
No way, had to re-study, fortunately, before writing a few documents, but also left a few lines of code, think or put on the blog, later own view is more convenient.
Data Cleansing :
Data cleansing refers to the discovery and correction of identifiable errors in a data file, including checking data consistency, handling invalid values and missing values, and so on.
From the name of the technique, it is easy to understand that the dirty data is washed away (discarded), or cleaned (corrected).
Like an elephant in a refrigerator, data cleansing can generally be divided into three steps:
ETL : Extract-transform-load . This actually describes the three aspects of building a data warehouse: Data extraction, data transformation, data loading.
But it is generally believed that data cleansing refers only to the process of data conversion.
Kettle:
Open Source ETL tool, written in pure java.
Kettle Chinese name is the kettle, the project's main programmer, Matt, wants to put all kinds of data into a pot and then flow out in a specified format.
download and related use Help , accessible: http://community.pentaho.com/projects/data-integration/
Interested in studying Kettle Source code , you can download Kettle source code:
SVN address: Svn://source.pentaho.org/svnkettleroot
Note: SVN has only 5.0 and previous versions, then migrated to GitHub
Git address: https://github.com/pentaho/pentaho-kettle/
Interested in the development of Kettle two times , may be used
Online Help manual: http://javadoc.pentaho.com/kettle/
Kettle Rookie study Note 1----related preparation knowledge