Use kettle to convert an XML document into a data table structure

Source: Internet
Author: User
Zookeeper

Use kettle to convert an XML document into a data table structure

 

Read and parse the XML file in kettle's get data from XML step and XML input stream (Stax) step. The get data from XML step is parsed using Dom, which consumes a lot of memory and is not desirable when the file is large. The XML input stream (Stax) Step parses large and complex files in different ways and can quickly load data. Therefore, we recommend that you use this step.

The following example shows how to use this step. The content of the source XML file is as follows:

<? XML version = "1.0"?>

<Timeseries>

<Measurementyear = "2000" type = "parmname" text = "parmname">

<! -- Value ofitem named a in 2000 is 8.5 -->

<Itemname = "A"> 8.5 </item>

<Itemname = "B"> 9.8 </item>

</Measurement>

<Measurementyear = "2001" type = "parmname" text = "parmname">

<Itemname = "A"> 12.2 </item>

<Itemname = "B"> 9.4 </item>

</Measurement>

<Measurementyear = "2002" type = "parmname" text = "parmname">

<Itemname = "A"> 11.1 </item>

<Itemname = "B"> 7.2 </item>

</Measurement>

</Timeseries>

 

The format of the data to be parsed into a data table is as follows:

 

Use the following steps to implement this function

  1. Xmlinput stream (Stax): loading XML documents in stream mode

  2. Filter row: Remove irrelevant document elements

  3. Switch/case step: Separate Level 1 (measurement) and level 2 (item)

  4. Rowdenormaliser: converts multiple rows at Level 2 into one row (row column)

  5. Mergejoin: Merge child elements to the first row (add column)

The example can beDownload here

 

Rowdenormaliser steps

The easiest way to understand the execution is to preview the execution results without steps (you can directly view the data stream in version 5.x) to display the execution principle of this step.

Mergejoin procedure

Merge two streams from different data sources. In fact, the join principle in SQL is the same, except that kettle is used for stream data, not table data. It is very important that the stream data should be sorted by the key (used in join). The first step in this column is sorted when the XML file is loaded, therefore, the "Sort rows" step is not selected.

Use kettle to convert an XML document into a data table structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.