Information extraction based on Web development mode

Source: Internet
Author: User

information extraction based on Web development mode


Information extraction is one of the most important aspects of Internet natural language processing, and the accuracy of information extraction will directly affect the subsequent processing. The goal of information extraction is to remove the noise, to obtain valuable information such as Web page title, time, body, link and so on.

 

Introduction to the mainstream algorithm

There are many ways to extract web information, for example, from the algorithm: based on the template, based on information, visual, based on semantic mining, based on statistics. From HTML processing is divided into: based on the row block, based on the DOM tree. Let me introduce the following.

1. Based on the template, generally by manual maintenance of a URL and HTML template. When a URL is matched to a URL template, the corresponding HTML template is used to extract the information. This method of quick results, high accuracy, grab a small number of stations can be used, you can do some template setup tools to reduce the workload, a large number of stations need more manpower maintenance template list.

2. Based on the amount of information (the interpretation of the information I will say below), see based on the block distribution function of the text extraction, the calculation of the text in the source line which lines more distribution, take more lines of text; In addition, there are algorithms based on the text density of the line to calculate, simple point is the text length/number of tags. Another method based on information is to create a DOM tree that transforms a row function into an evaluation function on a node in the DOM tree. For information Web sites, this method will work very well, but need to take into account that the extraction of web information does not mean that the text is good, such as the text under a section of copyright information or site description, how to remove this information. In addition, the game downloads the website divides into the game the structured information, the description information, the game operation explanation and so on part, the information is dispersed, but is not the centralism, this kind of information how handles.

3. Based on the Visual page segmentation algorithm, is based on the block algorithm of a specific implementation, this is a Microsoft Research Asia algorithm for Microsoft Search engine Bing. I prefer this algorithm, because I have two good ideas: first, according to the visual block, the second is based on the visual block merging. Based on the complexity of visual processing, the need to use CSS, JavaScript and other engines, the need to use the browser kernel library to process HTML, the performance may not be high. In addition, the results of this algorithm just tell you how many pages can be divided into pieces, the location of each block, what size, and which block and which is the text need to be further calculated.

4. Based on the semantic text extraction, based on the anchor texts and the page title and other error-prone information to find the body block, such algorithms are effective, but still have limitations.

5. Based on statistics, based on the combination of chunking and statistics of the news text extraction and based on the same layer of Web page similarity to remove Web noise. The former use statistics is to find the same page inside the body block, the latter is linked to the same path of different pages of the similarity to remove noise, there is a difference between the two. Based on statistics, we can reduce the error caused by the difference of individual web pages and improve the accuracy.

 

stand on the web Developer's perspective

All of these methods are considered from the rules of the Web page, can solve a part of the problem, and the root of the problem is that Web pages are developed by web engineers, research their web development habits and patterns for information extraction is the most fundamental, and I did web development, Therefore, several useful models for information extraction are summed up:

Mode 1 : Similar pages with a set of templates. Web sites, generally divided into CMS systems (such as the Imperial CMS), blog systems (such as WordPress), forum systems (such as discuzz), no matter what system, the same type of Web pages are based on the same template and background data generated static or dynamic pages, the structure is the same, And the content is not the same. If there is a revision, but also a unified revision, pure hand-made pages have been very few.

Mode 2 : Different functions of the information with block tags. All blocks are used block tags (group labels), HTML tags with block attributes have Div, TABLE, FORM, CENTER, UL, LI and so on.

mode 3 : repeating structure with loops. List data, forums, blog comments, usually get rows of data, and then cycle output based on rows.

Mode 4 : Organize blocks according to information. One is the style of distinction, navigation, text, related articles, comments, left navigation, the right side of the advertising style are different, and replies, replies are the same style. The second is the more relevant pieces, the more near the block, the text, related articles, comments on the very near, and the text from the right side of the advertisement is very far.

Mode 5 : Whether the web developer level is not high, or the Web site comparison Rogue, a lot of text is not clean, eager ads in the body of the folder.

According to the above analysis, combined with some of the above reference algorithms, this paper proposes an information extraction algorithm based on web development mode, which can solve the problem of accuracy and cleanliness in information extraction. Note: accuracy refers to the integrity of the body, cleanliness refers to the body does not contain noise.

An algorithm description of information extraction based on web development pattern

1. According to "mode 1" to collect the same domain name or the path of n (n>=1) pages, the same domain name or the path of the page with the same template is more likely. If n=1, then retreat to a single Web page of Information extraction, single page extraction, for the Express, short blog extraction difficult, if a group of web pages after the combination of extraction can be a better solution to this problem.

2. Build n dom Trees according to the block tags of HTML in "mode 2" respectively. This tree is not all HTML tags are a node, only the block tag of the visual node to build a node, not only to meet the needs of information extraction, but also to improve efficiency. The following figure is the 5 dom tree that was built.

Figure 1

3. To determine whether the N Dom is similar, the main is to select the higher layer of the DOM tree nodes to determine whether their structure is similar, take a similar DOM tree to merge its node characteristics, maybe N dom belong to more than one template, you can merge multiple Dom trees, one calculation can be. Features are: Body length, number of links, text length in the link, picture size, number of labels. Suppose the merged Dom tree is d. If a path has a node in a different DOM tree with exactly the same characteristics, the node can be ignored (remove duplicate noise information such as copyright, site description, etc.).

4. According to "Mode 4", the merging of similar body blocks for D, such as Figure 2, 10, 11, 12 are the body blocks (which can be computed according to the node characteristics), and the same parent node can be merged into node 7. This step is mainly to have some blog or Web site, the body is distributed in several blocks, if not to merge, the extraction of the body will not be complete.

Figure 2

5. According to "mode 3", the continuous block in D is merged, which is mainly for comments, forum information, such as Figure 3, nodes 2, 3, 4, 5, 6 are the same structure nodes, merged into Node 2. If not merged, a small chunk of it is extracted, resulting in incomplete information. At the same time, for the cycle of continuous blocks, there is a need to reduce the weight of the processing, some blog posts and comments, the weight of comments will be larger than Bowen, not down the right, will be extracted to comment rather than blog.

Figure 3

6. Find the largest chunk of information. Here's an explanation of the concept of information, which is a quantitative standard for text, links, pictures, videos, animations, and their styles to convey information to users . Frankly speaking, is the webpage wants to give the user what information, the content page gives the user is the content, but the navigation page gives the user is the link, the information quantity computation formula is not the same. Figure 4 is the structure of a Web page, root node 1 There are 3 nodes: 2, 3, 4, according to the amount of information calculated formula, node 3 information maximum, take node 3, node 3 under 7, 82 nodes, 7 largest, 7 below 11, so take node 11 as the Body node. Why not take node 15, there are two possibilities, one is that node 15 in the 4th step has been merged into Node 11, the other is the node 15 information on node 11 of the amount of information than too little, will not be selected.

Information Formula = Body Information + link information + picture information + video (including Flash) info + tag Information

Figure 4

7. According to "mode 5", found the body node is only the text is included in the body node, found in the body node still contains noise, such as the body block in the inclusion of advertising information, such as the body block contains too many related links information and so on. At this time, need for the body block for further cleaning, remove noise information. For a forum, if you only need information about the post itself, instead of user information, can be repeated in accordance with the characteristics of forum replies, the calculation of each block in the amount of information variance, large variance for the post block (because the length of the post is very different), the variance is small user information block (User information block difference).


Algorithm Advantage Analysis

1. Use the group label and the CSS in the label to block, instead of the VIPs color, size, position and other information, simplifying the calculation process, high efficiency, so the calculation, in the actual application of the effect is also better. If you can take into account the color, position, size factor, will further improve the accuracy rate. Whether you need to deal with CSS, see the actual needs.

2. Using the same template under the different page structure similarity, and the similarity of the inner loop block, to extract information, than a single page, a single block for information extraction, the accuracy of the increase in the rate of more than 3% (estimated). For example, the news (only one sentence) is more difficult to deal with, for example, only to extract the content of posts in the forum (left personal information, signature files are not), such as the extraction of blog posts and do not comment, etc., by observing a group of similar structures to process information , this idea can be extended to other types of page information extraction.

3. The algorithm is more general, only need to according to different algorithms to merge similar blocks and design a reasonable amount of information formula, can provide various kinds of application scenarios for different types of extracts, such as the extraction of text, pictures and video, links and structured content.

4. Further optimization, on the basis of structural similarity, you can retain the characteristics of the Web page structure and the Web page information template , in case the page can not be extracted information use, especially for the Forum and blog posts such as the number of fixed pages is more important.


Actual application Effect

In practical application, the text extracts part, for tens of thousands of sites (including information, blog, forum site) data sampling test, the accuracy can reach more than 96%.

This algorithm for the Hub Link Analysis section, Analysis of the hub page needs to crawl page links, does not contain the right and left sides of the hot, navigation and other links, tens of thousands of hub page test, the accuracy rate reached 92%. If the position of the block is taken into account, the effect will be better.


some other problems when extracting information

Tag fault tolerance: This algorithm does not recognize the content of the attribute, do not recognize the content of CSS and script, only need to deal with tag matching, even if the label matching error does not matter, as long as the information can be extracted.
Code recognition: Can extract the header in the CharSet, if not, you can use the Mozilla CharSet detection components to automatically identify. The coding proposals are converted to utf-8.

Language recognition: It is possible to determine the probability of a character's distribution in which language range by using the Utf-8 coding interval of Japan and Korea.

Title Extraction and Purification: Anchor text and title combined, according to the rules truncated title, "_ News Center _ Sina Net" and other meaningless remove, can also be based on similar pages of the common part of the title removed to intercept.

Date Time recognition: The body area is not far from the place, with the regular to match it. If you have more than one time, you can take more than a certain time (after 2000). The most recent but not more than the current time.

Image extraction: Extract the large picture link of the main body area, the introduction text of the picture can extract the text below the picture or the text around the picture and the text of the caption.

Link Extraction: Extract the most links to the block, if the link + introduction + thumbnail map of the Hub page, you can use text and pictures as weights to calculate in. Hub page is also diverse, difficult than the text extraction small.

Others: later thought to replenish

Appendix: The following is the document section, which is shared together for reference.

Information extraction based on Web development mode

Web Page Information Extractor

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.