web|xml| Data | Data Source | page | Transformation This article provides you with a powerful and flexible way to extract and combine meaningful data from existing HTML files.
HTML and the Web have changed the way people communicate and communicate forever, HTML is convenient for people, it makes information easier to see and sail, but unfortunately, it is far less convenient for communication between computers: on a Web page because of the implementation of the layer code, Makes it difficult for computer systems to find and use data. At this time, XML came into being and promised to play the same role in the way communication between computer systems, and XML became the common language in which information flowed between different computer systems. Using the simple programming techniques described in this article, you can convert any HTML page into an XML data source.
Routine description
Imagine a florist who buys from 3 flower wholesalers per week at a price. Each week, the owner looks at the Web site of each wholesaler to find the lowest price. The owner wants to combine the price information of the three wholesalers into a Web page to simplify his search process.
Here's how to extract information from 3 Web pages and combine them into a single XML document. We created 3 flower Wholesalers ' pages for this example:
<>
<>
<>
For demonstration purposes, the price on each page will change each time the page is accessed. In addition, because Web pages often place related data in HTML tables, sample pages and applications will focus on recovering information from tables rather than from other tags.
Solution
The following table contains a sample XML file containing the data you need flowers.xml:
Now you need to write code to extract the name and price of the flowers directly from the appropriate wholesaler Web site. One solution is to place special tags in the XML document and replace them with the values in the site later. This method is the same as XSL. To do this, you can define a new meta language that allows you to add replaceable tags to the XML.
This new meta language needs to complete the following tasks:
Identify the document in order to know that it is using the language
Provides a way to specify the Web page that contains the data you want to recover
Specifies how to recover a specific data element from each page. The following example extracts the previous XML file and includes a new meta language tag to complete the 3 targets listed above:
<wg:documentxmlns:wg= "" >
<Flowers>
<wg:templateurl= "" >
<Flower>
<Vendor>FakeFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "8" col= "1"/></name>
<price><wg:gettableelementpos= "1" row= "8" col= "4"/></price>
</Flower>
</WG:Template>
<wg:templateurl= "" >
<Flower>
<Vendor>FictitiousFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "6" col= "2"/></name>
<price><wg:gettableelementpos= "1" row= "6" col= "3"/></price>
</Flower>
</WG:Template>
<wg:templateurl= "" >
<Flower>
<Vendor>PretendFlowers</Vendor>
<name><wg:gettableelementpos= "1" row= "3" col= "1"/></name>
<price><wg:gettableelementpos= "1" row= "3" col= "4"/></price>
</Flower>
</WG:Template>
</Flowers>
</WG:Document>
The second XML example has a package element called document, placed at both ends of the original XML. The document element defines a namespace for this new meta language called Webgather. The Webgather language element is defined in the Webgatherschema.xml file:
Webgatherschema allows the use of three types of XML elements: the Template (template) element has only one attribute "URL" that defines the source Web page that contains the data. A gettableelement tag is a content placeholder for a cell in a table within a page defined in a template element. The gettableelement tag has three properties, the first property is called "POS", which defines the index number of the table element in the Htmlweb page, where the first table is 1; The row and Col properties define the cells in the table that contains the data.
Concrete implementation
The meta language requires execution to work. I used a Visualbasicdll engineering file that contained only one class called Metagather. This class uses a public method, called transform, that receives an XML string containing the Webgather tag that replaces those tags with the values from the specified Web page and generates an XML string. This class uses the Microsoftinternetexplorer control to recover the Web page and reads the URL of the page containing the data from the template tag in the XML string parameter.
The LoadPage function calls the Navigate method of the MicrosoftInternet control to retrieve the page content from the Internet. Navigate work is not synchronized, so you must wait for the page to load before continuing. This can be achieved by using a loop to wait for a module-level flag setting to complete. This flag is set when the DownloadComplete event is activated. Typically, the DownloadComplete event is activated when the Navigate method is invoked. This way, even if the navigation fails, the flag is guaranteed to be set to true, and then we exit the loop.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.