In FME through Htmlextractor to HTML to data

Source: Internet
Author: User

How to continuously expand the data center data scale, improve the value of data mining, this is our thinking, data on the one hand from the internal production, part of the data can come from the Internet, the volume of data on the internet is huge, the form is diverse, many fmeer in the previous blog have proposed a program, such as JSON, XML, regular expressions, and so on, but how do you extract data from loosely-structured HTML? I asked the degree Niang, seemingly no fme under the article, coincides with the time today, just write a little about the HTML extract, is a note of their own!
This time I want to extract the sample data from the Land and Resources Bureau of the system, I want to extract the transaction results and parcel information, style such as:

Figure 1: List of trading results

Figure 2: Parcel Information



Figure 3: Conversion Engineering

Figure 4: Extracted data
In this conversion project, several converters were used: Pythoncreator,httpcaller,htmlextractor, Pythoncaller, Stringsearcher, Stringreplacer, Attributeexposer, Attributerenamer, Attributeremover
This article focuses on Htmlextractor, the parameters of the converter such as:

Figure 5:htmlextractor Parameters
The parameters labeled on the diagram are in turn:
1, the content source of HTML input:html, can be contents, indicates the attribute, parameter, etc., which comes from incoming, or it can be file, which means that it originates from an existing HTML file.
2, HTML content: This case uses the content as the source, with Httpcaller, HTML is stored in the _response_body attribute. If file is the source, you will need to set the HTML file path.
3, Target Attribute: Set a property (list) name, this property name will contain the result of HTML parsing.
4, CSS Selector: Set CSS selectors, similar to regular expressions, but easier to use, especially for parsing HTML.
5, Tag part/html Attribute: Can be set to value (matching the value in the tag), Whole (matching the label and value), or enter a matching tag has a property name, such as the <a> tag's href attribute.
6. Return Format: Can be set to list Attribute, then all matching content is returned as a list, and if it is first match, it will only be returned with a matching content.


Give me a chestnut, here's what I want to match. The HTML source file for the trade results:
<tr class= "TR2" onmouseover= "this.classname= ' TR3 ';" onmouseout= "this.classname= ' TR2 ';" >
&LT;TD height= "to" align= "left" class= "TD1" >2</td>
&LT;TD class= "TD1" align= "left" >bq2-19-87</td>
&LT;TD class= "TD1" align= "left" >state-owned construction land use right</td>
&LT;TD class= "TD1" align= "left" >158.51 million USD</td>
&LT;TD class= "TD1" align= "left" >158.51 million USD</td>
&LT;TD class= "TD1" align= "left" >Xi ' an ODA real Estate Development Co. , Ltd.</td>
&LT;TD class= "TD1" align= "left" >2017-04-27 16:00</td>
&LT;TD class= "TD1" align= "center" style= "color: #FF0000; cursor:pointer;" onclick= "window.open (' publics/ Resourceframe.jsp?id=933&lx=l ', ', ' left=10,top=10,width=890,height=650,scrollbars=yes,resizable=yes,status =yes ') ">has been sold</td>
</tr>


I want to extract the red content, I just need to write a simple CSS selector to match, but before writing it is generally necessary to tidy up the analysis of the HTML source files, find the features that can be used to match, improve the accuracy of matching, reduce other impurities data is extracted.
Because the HTML source file has a large number of <td&gt, so the direct matching TD is not, after analysis I found the feature, CSS selector: Tr[onmouseover] TD. It means the TD tag under the TR tag with the onmouseover attribute.
As simple as this, the data obtained has a small amount of impurities, which can then be cleaned with other converters.
In addition, the recent regular expression is very high, it must be admitted that the regular expression is very powerful, but some work or there is a simpler way to kill chicken with sledgehammer, for HTML, by writing CSS selector application Htmlextractor Converter to parse the data, more agile and efficient!

In FME through Htmlextractor to HTML to data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.