Design and Implementation of Web Page data extraction Software

Last Update:2018-12-05 Source: Internet

Author: User

Tags wrappers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the rapid development of the Internet, web has become one of the main channels for people to obtain information. Most enterprises or groups publish enterprise information through web pages. To make full use of this information, not just browsing, applications must extract data of interest from HTML web pages and convert it into formatted data with a certain structure. The task of Web wrapper software is to extract HTML data and convert it into structured data. Web wrapper-based applications can access web data by accessing information in the database. Therefore, Web wrapper is a key part of the Web data integration architecture. Based on the Concept Design of Web wrapper, the paper designs Web page data extraction wrapper software based on the latest web technology, information processing technology and artificial intelligence technology, based on the Web New Book Publishing Page Information Extraction experiment, this paper analyzes the algorithm and system performance, and verifies the feasibility and efficiency of the package software.
Http://www.xzbu.com/7/view-3009321.htm, China

1. Concept Design of Web wrappers

Define wrapper: Given a web data source s that contains a series of web page p (where p = {p1, p2 lambda, PN}), find a ing relationship W, it can map P in s to a data set R, and correctly extract data when the PI, J, {1, lambda, n} structure does not change much. Ing w is generally referred to as the Web wrapper ).
In terms of function, wrapper is a program that extracts data from a specific semi-structured web data source based on specific Extraction Rules. The core of wrapper is Extraction Rules. Extraction Rules are used to extract relevant information from each HTML document.
In the maintenance process, wrapper verification is performed first, and then the maintenance process is started. When the page changes, the data extracted by wrapper may be incorrect or cannot be extracted, which triggers the maintenance routine. In essence, maintenance is to re-establish Extraction Rules on the new page to complete the Automatic Repair Process of wrapper.
Web Information Extraction is to identify data of interest to users from the unstructured or semi-structured information contained in the web page, and convert it into a clearer structure and semantics (XML, relational data, object-oriented data, etc ). Information extraction can be understood as a process of extracting information from the text to be processed to form structured data and store it into a database for users to query and use. Therefore, in order to extract and convert information, the Web package must have four capabilities: (1) information positioning: determining the location of the required information in the document; (2) Data Extraction: extract data from text content by field. (3) Data Organization: organize the extracted data according to the correct structure and semantics. (4) maintainability: when the web page changes, the Web Wrapper can still extract data correctly. Therefore, we have designed an efficient Web wrapper algorithm as follows:
Input:
-Config. xml configuration file: Web data source s extraction rule definition;
-S: Web data source;
-P: web page of Web data source S, where p = {p1, p2 lambda, PN };
-T: DOM tree generated after HTML parsing, where T = {T1, lambda, tn };
-B: information block to be extracted, where B = {B1, K, BM };
-Express: expression;
Output:
-R: Data Extraction result set R = R1, yr2kyrn
① Parse config. xml configuration using JDOM;
② R = (empty dataset );
③ For (INT I = 1; I <= N; I ++)
{
Parse PI in S to obtain Ti, that is, PI → Ti
Extract part BJ from Ti positioning information, that is, Ti → BJ, where J, {1, lambda, m}
// Perform the following operations on BJ in PI;
④ For (Int J = 1; j <= m; j ++)
{
Use expression express to extract data from BJ and record it as Ri J = {limit 1, lambda, rjk };
K indicates the data model that extracts data from S to generate K fields.
}
⑤ Return rI = ri1yri2 Lambda yrim}, where I, {1, lambda, n}
}
⑥ Return r = r1yr2 Λ yrn

Ii. Design of Web wrapper Software

Based on the preceding algorithm, the Web package consists of three modules: pre-defined module, data extraction module, and data organization module. The pre-defined module and data extraction module are the core components of the Web package.
1. pre-defined module. The pre-definition module mainly defines Extraction Rules. The Web wrapper designed in this article is a rule-based extraction model. Considering the maintainability and reusability of this package, the xml configuration file (config. XML) to locate and extract information. If the Web Data Source Page is changed, you only need to change the configuration file (config. XML) for the Web data source to maintain the web package. The maintenance of web wrappers can be easily and quickly solved when the webpage organization form changes little. The template of the pre-defined extraction rule config. xml configuration file is as follows:
<? XML version = "1.0" encoding = "gb2312"?>
<Config>
<URL> Web source webpage address </URL>
<Beginpage> start page </beginpage>
<Endpage> end page </endpage>
<Tag> tag </Tag>
<Index> index number </index>
<RegEx> Regular Expression </RegEx>
</Config>
2. Data extraction module. As the core part of the Web package, the data extraction module completes information locating and information extraction. Page Parsing is mainly used to parse HTML document format files. You can use the htmlparser parser. Htmlparser is a pure Java-written library for HTML Parsing. It does not rely on other Java libraries and is a good tool for HTML parsing and analysis, attackers can capture webpage data or modify HTML content as needed. This module mainly locates the extracted information, that is, to determine the position of the information block to be extracted in the document.
After the information is located, the required data is analyzed by field based on the regular expression in the defined extraction rule. Regular Expression is a powerful tool for pattern matching and replacement. A regular expression is a text pattern consisting of common characters and special characters (called metacharacters). It describes one or more strings to be matched when searching for a text subject. A regular expression is used as a template to match a character pattern with the searched string, so that the required data can be analyzed by field.
3. Data Organization module. The Web wrapper extracts and stores structured data from semi-structured information. Therefore, how to save the extracted data in a structured manner is also a key part of the Web wrapper. The data organization module completes the process of extracting results. We use the XML format to organize the extraction results. The XML language has a good data storage format, scalability, high structure, and is easy to interact with the database, so that we can further process the extracted information in the future, such as search and classification.

Iii. algorithm verification and result analysis

To verify the feasibility, efficiency, and application in Web information integration, we use the above design idea to define the Extraction Rules and web package for the page of the new book recommendation and publishing websites of Tsinghua University, metallurgical industry, Peking University, and other publishing houses, and conduct the extraction test of Book information. Ce, Te, and Fe are used to indicate the number of extracted correct information, the number of unextracted correct information, and the number of extracted error information; the recall rate is also called the recall rate using R; the precision is also called the recall rate using P. The experimental results are calculated based on the formula R and P, as shown in table 1:

The experimental results in the above table show that the recall rate (recall rate) and precision (precision rate) of the package can reach nearly 100%. After analysis, the Publishing Page of Tsinghua University Press is displayed as a list. In order to extract detailed data, data is extracted using secondary extraction. Some books are not extracted due to different positioning of some books during the extraction process. However, from the overall experimental results, the Web package design is feasible and efficient.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More