(From: http://www.ibm.com/developerworks/cn/xml/x-wbdm)
June 01, 2001
It is undeniable that the World Wide Web is by far the world's richest and most intensive source of information. However, its structure makes it difficult to use system methods to exploit information. The methods and tools described in this article will allow developers familiar with the most common web technologies to quickly and conveniently obtain the information they need to publish on the web.
The rapidly growing world wide web in the information age has resulted in massive distribution of various public information. Unfortunately, although HTML, as the main carrier of information, provides a convenient way to present information to readers, however, it may not be a good service or application that can be automatically extracted from it and driven by data.ProgramStructure of related information.
You have tried multiple methods to solve this problem. Most of the Methods map all parts of the HTML pageCodeAnd the Code fills in the information on the web page into the database. Although these methods may provide some benefits, they will become impractical for the following two reasons: first, they require developers to spend time learning a query language that cannot be used in other cases. Second, they are not robust enough to handle inevitable simple changes to the target Web page.
In this article, we will discuss a web-based data mining method developed using standard web technologies-HTML, XML, and Java. This method is even more powerful than other dedicated methods, and is comparable to its method, and for those who are already familiar with Web technology, you only need to make a small effort to receive good results. In addition, this article also provides a lot of code to start data extraction.
HTML: Advantages and Disadvantages
HTML is usually a media that is hard to be processed by a program. Most of the content on the Web page describes the data-driven system-independent format orchestration, and because you need to dynamically add titles and write other server-side scripts, therefore, the document structure may need to be changed every time you connect to the page. Because the format of the main parts of all web pages is unreasonable, the problem becomes more complicated. The result is that the current web browser is not rigorous in HTML syntax analysis.
Despite these problems, HTML still has advantages in data mining. The data you are interested in can usually be nested in a single HTML tree<Table>
Or<Div>
Tags are isolated. This allows the extraction process to be executed in a small part of the document. In the absence of client scripts, there is only one method to define the drop-down menu and other data lists. These aspects of HTML allow us to concentrate on data extraction once we have data in available formats.
Background Technology
The key to the data mining technology described here is to convert the existing web pages into XML, or convert them into XHTML, which may be more appropriate, A small part of the many tools are used to process XML structure data and retrieve appropriate data.
Fortunately, there is a solution that can correct the weakness of the HTML page design. Tidy (from someProgramming LanguageIs a free-of-charge product that can be used to correct Common Errors in HTML documents and generate equivalent documents with well-organized formats. You can also use tidy to generate these documents in XHTML (a subset of XML) format. (See references ).
The sample code in this article is written in Java, and when compiling and running them, you needClasspath
The tidy JAR file exists. They also need to make the XML library available through Apache projects, xerces, and xalan. Both libraries are based on the Code provided by IBM and control the XML syntax analysis and XSL transformation respectively. Each of these three libraries can be obtained from the web for free. To find them, you can follow the above link or refer to the reference materials later in this article. Understanding the Java programming language, XML, and XSL transformations will help you understand the following examples. You can find reference materials for these technologies at the end of this article.
Overview and Examples
We use an example to describe how to extract data. Suppose we are interested in tracking the temperature and humidity levels measured in Seattle, WA every day for several months. If no ready-made software is available to report such information to meet our needs, we still have the opportunity to collect such information from numerous public websites.
Figure 1 illustrates the entire extraction process. The web page is retrieved and processed only when a dataset that can be merged into an existing dataset is created.
Figure 1. Overview of the Extraction Process
With only a few steps, we can have a suitable and reliable system for collecting our information. These steps are listed here to provide a brief overview of the process, as shown in figure 1 in the higher form of the process.
- Identifies the data source and maps it to XHTML.
- Search for reference points in the data.
- Map Data to XML.
- Merge results and process data.
Each step in these steps is described in detail and provides the Code necessary to execute them.
Get source information in XHTML format
To extract data, you must know where to find it. In most cases, the source information is obvious. If you want to collect data from developerworksArticleThe title and URL we will use http://www.ibm.com/developerworks? S_tact = 105agx52 & s_cmp = cn-a-X as our goal. In the weather example, we have a number of optional information sources. We will use Yahoo! In the example! Weather, but using other sources of information has the same effect. We will track the data on URL: http://weather.yahoo.com/forecast/seattle_wa_us_f.html. Figure 2 shows the screen snapshot of this page.
Figure 2. Yahoo! In Seattle, wa! Weather web page
Note the following important factors when considering information sources:
- Does a source of information generate reliable data on a reliable network connection?
- How long will the source of information exist from now on? One week, one month, or even one year?
- How stable is the layout structure of information sources?
When we look for a robust solution that can work in a dynamic environment, our work will be the easiest to extract the most reliable and stable information sources available.
Once the information source is determined, the first step in the extraction process is to convert the data from HTML to XML. We will constructXmlhelper
(Composed of the static helper function) Java class to complete this task and other XML-related tasks. All the information sources of this class can beXmlhelper. Java
AndXmlhelperexception. Java
. As this article continues, we will build methods for this class.
We use the functions provided by the tidy library inXmlhelper. tidyhtml ()
Method. This method accepts the URL as a parameter and returns an "XML document" as the result. When calling this method or any other XML-related method, carefully check whether there are any exceptions. Listing 1 shows the code for performing these operations. Figure 3 shows the code result. Microsoft's Internet Explorer XML viewer uses the XML on the weather page.
Listing 1
/**
* Retrieve an HTML page, convert the source to XML,
* And write the result to a file.
*/
Public static void main (string ARGs []) {
Try {
Document Doc = xmlhelper. tidyhtml ("http://weather.yahoo.com/forecast/Seattle_WA_US_f.html ");
Xmlhelper. outputxmltofile (Doc, "XML" + file. Separator + "weather. xml ");
} Catch (xmlhelperexception xmle ){
//... Do something...
}
}
Figure 3. Convert to XHTML Yahoo! Weather web page
Search for data reference points
Please note that the vast majority of information in both web pages and source XHTML views is irrelevant to our completion. The next task is to find a specific region in the XML tree, from which we can extract our data without caring about external information. For more complex extraction, we may need to find several instances in these regions on a single page.
The simplest way to complete this task is to first check the web page and then use XML. You only need to look at the page to know that the information we are looking for is located in the middle and upper area of the page. Even if you are very familiar with HTML, it is easy to infer that the data we are looking for may be included in the same<Table>
Element, and the table may always contain words such as "appar Temp" and "dewpoint", regardless of the data of the day.
Write down what we have observed. Now we need to consider the XHTML generated on the page. Search for the text of "appar Temp" (as shown in Figure 4) indicates that the text is indeed in a table that contains all the data we need. We will use this table as a reference point or anchor.
Figure 4: locate the anchor by searching the table containing the text "appar Temp"
Now, we need to find this anchor method. Because we are preparing to use XSL to convert our XML, we can use the XPath expression to complete this task. We will use the following normal expression:
/Html/body/center/table [6]/TR [2]/TD [2]/table [2]/TR/TD/table [6] |
This expression specifies the slave Root<HTML>
The path from the element to the anchor. This common method will cause the modification of the page layout to be very vulnerable. A better way is to specify the anchor according to the surrounding content. By using this method, we reconstruct the XPath expression:
// Table [starts-with (TR/TD/font/B, 'appar temp ')] |
... Better. You can use XSL to convert the XML tree into a string:
// Table [starts-with (normalize-space (.), 'appar temp ')] |
Map Data to XML
With this anchor, we can create code for actually extracting data. This code will appear in the form of an XSL file. The purpose of the XSL file is to identify the anchor, specify how to obtain the data we are looking for from the anchor (in a short jump), and construct an XML file in the format we need to input. This process is actually much easier than you think. Listing 2 shows the XSL code that will execute this process, which can also be obtained as an XSL text file.
<XSL: output>
The element only tells the processor that the expected transformation result is XML. First,<XSL: Template>
Create Name:<XSL: Apply-templates>
To search for the anchor. Second,<XSL: Template>
Let's just match the content that needs to be matched. Last one,<XSL: Template>
InMatch
Define the anchor in the property and then tell the processor to jump to the temperature and humidity data we are trying to mine.
Of course, writing only XSL will not be completed. We also need a conversion tool. Therefore, we useXmlhelper
Perform the syntax analysis on XSL and perform this conversion. The method for executing these tasks is namedParsexmlfromurl ()
AndTransformxml ()
. Listing 3 shows the code for using these methods.
Listing 2
<? XML version = "1.0" encoding = "ISO-8859-1"?> <XSL: stylesheet version = "1.0" xmlns: XSL = "http://www.w3.org/1999/XSL/Transform"> <XSL: output version = "1.0" indent = "yes" encoding = "UTF-8" omit-XML-declaration = "no" method = "XML"/> <XSL: template match = "/html"> <result> <weather> <XSL: Apply-templates/> </weather> </result> </XSL: Template> <XSL: template match = "text ()"> </XSL: Template> <XSL: template match = "table [starts-with (normalize-space (.), 'appar temp ')] "> <temperature> <XSL: value-of select = "TR/TD [2]/font"/> </temperature>
Listing 3
/*** Retrieve the XHTML file written to disk in the listing 1 * and apply our XSL transformation to it. write the result * to disk as XML. */public static void main (string ARGs []) {try {document XHTML = xmlhelper. parsexmlfromurlstring ("file: // weather. XML "); document XSL = xmlhelper. parsexmlfromurlstring ("file: // XSL/weather. XSL "); document xml = xmlhelper. transformxml (XHTML, XSL); xmlhelper. outputxmltofile ("XML" + file. separator + "result. XML ");} catch (xmlhelperexception xmle ){//... do something ...}} |
Merging and processing results
If we only extract data once, we have completed it. However, we don't just want to know the temperature at a certain time point, but need to know the temperature at several different times. Now, we need to perform the extraction process repeatedly and merge the results into a single XML data file. We can use XSL again, but we will merge the XSL fileXmlhelper
Class.Mergexml ()
This method allows us to merge the data obtained from the current extraction into the archive file that contains the previously extracted data.
Weatherextractor. Java
The code used to run the entire process is provided in the file. I leave the program to the reader to execute the scheduling task, because the system-related methods for executing these tasks are generally more advanced than simple programming methods. Figure 5 shows running once a dayWeatherextractor
.
Figure 5. Web extraction results
Conclusion
In this article, we have described and proved the basic principles of strong methods for extracting information from the largest source of information-the World Wide Web. We also discussed the coding tools necessary for any Java developer to extract their own work with minimal effort and experience. Although the examples in this article only focus on extracting information about weather in Washington and Seattle, almost all the code that appears here can be reused in any data extraction. In fact, exceptWeatherextractor
Except for a few changes to the class, only the XSL Transform code is required for other data mining projects (by the way, it never needs to be compiled ).
This method is as simple as imagined. By choosing reliable data sources wisely and selecting content-related but format-independent anchors from these data sources, you can have a low maintenance cost and reliable data extraction system. In addition, you can install and run it within one hour based on the experience level and the data volume to be extracted.
References
- For more information, see the original article on the developerworks global site.
- Tidy for Java is maintained by Sami lempinen and can be downloaded from SourceForge.
- The XML library, xerces, and xalan can be obtained from the Apache project website.
- For more information about XML, developerworks provides a zone related to this technology.
- There are many tutorials on XSL and XPath. You can find them using your favorite Web search engine.
- Jussi myllymaki published a paper on the relationship between Web search and data extraction in the Andes System on www10 in Hong Kong.
- Here are some tips to personalize websites and maximize the performance of websites.
- "Manage website performance" describes how to fine-tune website performance from browsers to database servers and old systems.
Author Profile
|
|
|
Jared Jackson has been working at IBM Almaden research center since he graduated from Harvey Mudd in May 2000. Jared is also a graduate student in the Computer Science Department at Stanford University. Can pass through mailto: jjared@almaden.ibm.com? Cc = jjared@almaden.ibm.com contact Jared. |
|
|
|
Jussi myllymaki joined the IBM Almaden Research Center as a research clerk in 1999 and obtained a doctorate from the University of Wisconsin in Madison. Can pass through mailto: jussi@almaden.ibm.com? Cc = jussi@almaden.ibm.com contact Jussi. |