Convert a Web page to an XML data source

Last Update:2017-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

web|xml| Data | Data Source | page | Transformation This article provides you with a powerful and flexible way to extract and combine meaningful data from existing HTML files.

HTML and the Web have changed the way people communicate and communicate forever, HTML is convenient for people, it makes information easier to see and sail, but unfortunately, it is far less convenient for communication between computers: on a Web page because of the implementation of the layer code, Makes it difficult for computer systems to find and use data. At this time, XML came into being and promised to play the same role in the way communication between computer systems, and XML became the common language in which information flowed between different computer systems. Using the simple programming techniques described in this article, you can convert any HTML page into an XML data source.

Routine description
Imagine a florist who buys from 3 flower wholesalers per week at a price. Each week, the owner looks at the Web site of each wholesaler to find the lowest price. The owner wants to combine the price information of the three wholesalers into a Web page to simplify his search process.

Here's how to extract information from 3 Web pages and combine them into a single XML document. We created 3 flower Wholesalers ' pages for this example:

<>
<>
<>

For demonstration purposes, the price on each page will change each time the page is accessed. In addition, because Web pages often place related data in HTML tables, sample pages and applications will focus on recovering information from tables rather than from other tags.

Solution
The following table contains a sample XML file containing the data you need flowers.xml:

<Flowers><Flower><Vendor>FakeFlowers</Vendor><Name>Daffodil</Name>
<Price>$2.00</Price></Flower>
<flower><vendor>fictitiousflowers</vendor><name>daffodil</name><price>$ 5.00</price>
</Flower>
<Flower><Vendor>PretendFlowers</Vendor>
<Name>Daffodil</Name><Price>$3.50</Price></Flower>
</Flowers>

Now you need to write code to extract the name and price of the flowers directly from the appropriate wholesaler Web site. One solution is to place special tags in the XML document and replace them with the values in the site later. This method is the same as XSL. To do this, you can define a new meta language that allows you to add replaceable tags to the XML.
This new meta language needs to complete the following tasks:

Identify the document in order to know that it is using the language
Provides a way to specify the Web page that contains the data you want to recover
Specifies how to recover a specific data element from each page. The following example extracts the previous XML file and includes a new meta language tag to complete the 3 targets listed above:
<wg:documentxmlns:wg= "" >
<Flowers>
<wg:templateurl= "" >
<Flower>
<Vendor>FakeFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "8" col= "1"/></name>
<price><wg:gettableelementpos= "1" row= "8" col= "4"/></price>
</Flower>
</WG:Template>
<wg:templateurl= "" >
<Flower>
<Vendor>FictitiousFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "6" col= "2"/></name>
<price><wg:gettableelementpos= "1" row= "6" col= "3"/></price>
</Flower>
</WG:Template>
<wg:templateurl= "" >
<Flower>
<Vendor>PretendFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "3" col= "1"/></name>
<price><wg:gettableelementpos= "1" row= "3" col= "4"/></price>
</Flower>
</WG:Template>
</Flowers>
</WG:Document>

The second XML example has a package element called document, placed at both ends of the original XML. The document element defines a namespace for this new meta language called Webgather. The Webgather language element is defined in the Webgatherschema.xml file:

<!WebGatherschema-->
<schemaxmlns= "Urn:schemas-microsoft-com:xml-data" >
<elementtype= ' Document ' >
<elementtype= ' Template ' >
<attributetype= ' url '/>
<elementtype= ' Gettableelement ' >
<attributetype= ' pos '/>
<attributetype= ' Row '/>
<attributetype= ' col '/>
</element>
</element>
</element>
</Schema>

Webgatherschema allows the use of three types of XML elements: the Template (template) element has only one attribute "URL" that defines the source Web page that contains the data. A gettableelement tag is a content placeholder for a cell in a table within a page defined in a template element. The gettableelement tag has three properties, the first property is called "POS", which defines the index number of the table element in the Htmlweb page, where the first table is 1; The row and Col properties define the cells in the table that contains the data.

Concrete implementation
The meta language requires execution to work. I used a Visualbasicdll engineering file that contained only one class called Metagather. This class uses a public method, called transform, that receives an XML string containing the Webgather tag that replaces those tags with the values from the specified Web page and generates an XML string. This class uses the Microsoftinternetexplorer control to recover the Web page and reads the URL of the page containing the data from the template tag in the XML string parameter.

Privatefunctionloadpage_
(byvalstrurlasstring) Asboolean
' Initializethedownloadcompleteflag
M_bdownloadcomplete=false

' Loadthepagetomakesureits
' Notthecachedversion
M_ie. navigatestrurl,4

' Waituntildocumentfinishesloading
Whilem_ie. Readystate<>readystate_complete
DoEvents
Wend

' Checkifyouendedupontheerrorpage
Ifm_ie. Document.title=_
"Thepagecannotbefound" Or_
M_ie. Document.title= "Nopagetodisplay" _
Then
Loadpage=false
Else
Loadpage=true
EndIf
Endfunction

The LoadPage function calls the Navigate method of the MicrosoftInternet control to retrieve the page content from the Internet. Navigate work is not synchronized, so you must wait for the page to load before continuing. This can be achieved by using a loop to wait for a module-level flag setting to complete. This flag is set when the DownloadComplete event is activated. Typically, the DownloadComplete event is activated when the Navigate method is invoked. This way, even if the navigation fails, the flag is guaranteed to be set to true, and then we exit the loop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Convert a Web page to an XML data source

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support