VIPs: Visual-based page segmentation algorithm [Microsoft's next-generation search engine core paging algorithm]

Source: Internet
Author: User
Reprinted please indicate the source:, http://blog.csdn.net/tingya Thank you cooperation Source: http://www.ews.uiuc.edu /~ Dengcai2/tr-2003-79.pdf VIPs: Visual Web page paging algorithm 1. Question proposalAt present, with the rapid development of the Internet, web has become the largest information source in the world. As a carrier of information technology, web has become an important tool for people to work, learn, live, and entertain. The development of Web brings great convenience to human life. People can share a large amount of information across time and space boundaries. However, how to obtain such Web information is a common problem. At the most basic level, the entire web network is composed of countless web pages. Therefore, if these web pages are obtained, they are equivalent to obtaining Web information. In fact, many of the current Web information retrieval technologies are based on this theory. However, it is not reasonable to take the entire page as a basic information acquisition unit. Although users usually put some related content on the same page, in most cases, A page usually contains more than one type of theme. For example, a Sina page may contain sports information, health information, advertisements, navigation links, and other information. The information is distributed across different pages. Therefore, to obtain Web information more accurately, we must be able to further extract the semantics of a given web page. Web page semantic extraction has been applied in many aspects. For example, in order to overcome the limitations of keyword search in Web information access, many researchers began to use database technology to build a wrapper to structure web data. In the process of building the wrapper, splitting a web document into a certain number of data blocks is the primary task. At present, most of the work is focused on the use of adaptive methods. If we can obtain the semantic content structure information of the web page, the process of building the package is very simple. Of course, the semantic information is easily extracted. Another application of semantic block extraction is the search engine. For search engines, link analysis is an extremely important task. Currently, for most search engines, the basic premise of the link analysis algorithm is that if there is a link between the two pages, there must be a certain relationship between the two pages as a whole. However, in most cases, the link from page a to page B only means that some part of page A may have a certain relationship with some part of page B. Many algorithms, such as PageRank and hits, are based on previous assumptions. Defining a link between two complete pages is more rough than defining a part of the two pages. Therefore, for search engines, it is necessary to divide a complete page into multiple semantic blocks to obtain links more accurately. At present, some work has been carried out to address this issue. However, these operations are based on the DOM tree to analyze the page structure. However, the DOM tree does not fully reflect the semantic structure of the page. Therefore, this method still has some defects. Another potential application of Web Page semantic segmentation is to access the Internet through mobile terminals. Currently, most web pages are designed for desktops. These pages are not suitable for direct access by mobile devices because mobile devices usually have small screens and limited computing power. Currently, this problem can be solved through two methods: Page conversion through the server or page thumbnails. The former performs paging and conversion on the page accessed by the user, and then submits the paging result to the mobile device. The latter generates a thumbnail page for the entire web page, the entire page is divided into areas of varying numbers. If you are interested in a specific area, you can access the content of the area again. Using these two policies, you can basically complete the task of accessing the Internet from a mobile terminal. However, the core content is how to perform semantic segmentation on the page. A lot of work has been done for effective paging of web pages. [Chakrabarti etal.2002] extracts structured information from the html dom tree. However, due to the flexibility of HTML syntax, most web pages do not fully comply with W3C specifications, which may cause DOM tree structure errors. More importantly, the DOM tree was first introduced to display the layout in the browser rather than to describe the semantic structure of the web page. For example, even if two nodes in the DOM tree have the same parent node, the two nodes are not necessarily associated in semantics. On the contrary, two semantic-related nodes may be distributed in different aspects of the DOM tree. Therefore, the semantic information of web pages cannot be completely obtained only by analyzing the DOM tree. From a human perspective, when a user observes a Web page, it will naturally treat a semantic block as a single object, it does not care how the internal structure of the web page is described. Generally, when identifying semantic blocks, users can use some visual factors to help, for example, the background color, font color and size, border, spacing between logical blocks and logical blocks, and so on. Therefore, if you fully use visual prompts on web pages and use the DOM tree to block semantic pages, you can make up for the shortcomings of using only the DOM tree. In this paper, we propose the vision-based page segmentation algorithm to extract the semantic structure of a given webpage. This semantic structure is a hierarchical structure in which each node represents a semantic block. Each semantic block defines a doc value to describe the relevance of the content inside the semantic block. The larger the value of Doc, it indicates the content inside the semantic block, the tighter the relationship between them, and the loose the opposite. The VIPs algorithm makes full use of the layout features of web pages: it first extracts all the appropriate page blocks from the DOM tree, and then detects all the split entries between them based on these page blocks, including the horizontal and vertical directions. Finally, based on these split entries, the semantic structure of the web page will be rebuilt. You can use the VIPs algorithm to separate each semantic block into smaller semantic blocks. Therefore, the entire VIPs algorithm is highly efficient and top-down. 2. Related WorkIgnore. 3. Visual content structure description on the web pageAnd [Chen et al. 2001] Similarly, the VIPs algorithm first defines the concept of "Basic Object". Generally, leaf nodes on the DOM tree are defined as basic objects, because these nodes cannot be further divided. In this paper, we first introduce a visual-based content structure. each node in it is called a "Block". These blocks are either a basic object or a combination of some basic objects. One thing to note is that the block in the visual content structure has no absolute correspondence with the node in the DOM tree. Similar to the description structure of the document in [Tang et al.1999], the structure of the web page in the VIPs algorithm is defined as follows. For each page, we can think of it as a triple Ω = (primary, Phi, Delta), where primary = (Ω 1, Ω 2 ,... Ω N) indicates the set of all semantic blocks on a given page. These semantic blocks do not overlap and overwrite each other, however, each semantic block Ω I can be defined as the triple Ω I = (I, and I) described above, so that iteration cycles are like this, limit 2 ,... T) indicates the set of all separators on the current page. In fact, once the two semantic blocks on a page are determined, the separation bars between the two semantic blocks are also determined. Of course, the separation bar in VIPs is not a virtual one. The separators include horizontal and vertical separators. Each separator has a certain width and height. Delta = (ε 1, ε = ,..., * M) describes the relationship between two semantic blocks in the Ω set. This relationship can be described in the following formula: Delta = δ × o → Φ∪ {null }. Each of the flags is a binary group, such as (Ω I, Ω J). It indicates that there is a shard between the block Ω I and Ω J.

Demonstrate the visual web page content structure of Yahoo pages. It also provides the layout structure of the page and the visual content structure. At the first layer, the entire original page is divided into four large visual object VB1-VB4, at the same time, the four objects are detected with three separators ranging from 1 to 3 (five originally, the top and bottom are discarded ). The four visible objects detected are not the final part of the split. The final semantic block must be further constructed based on the four detected visual objects and three separators. Some semantic blocks may need to be merged, and some separation bars should be discarded. For example, for VB2, three sub-objects and two separation bars can be detected from its interior, as shown in 1. For each block, the VIPs algorithm defines a corresponding DOC (degree of coherence. The value size reflects the closeness of the content in the current semantic block. If. It has the following two important features: 1) the greater the doc value, the closer the link between the content in the semantic block is, and the more continuous the relationship is, the smaller the reverse. 2) In terms of the number of levels, the doc value of the sub-block of the semantic block must be greater than that of the parent block. In the VIPs algorithm, the doc value is between 1 and 10. However, this range can be changed. Before semantic segmentation of web pages, we first set a predefined Doc value pdoc (Permitted degree of coherence) to limit the roughness of the segmented semantic blocks. When the doc value of the semantic block reaches pdoc, iteration segmentation stops. The smaller the pdoc, the rougher the semantic block to be separated. On the contrary, the finer the semantic block to be separated. For example, in Figure 1, if an appropriate pdoc value is given, the vb2_1 block is no longer allowed to be split. Different applications can set different pdoc values to meet their requirements. The main purpose of visual-based page segmentation is to perform semantic segmentation on the given page. Therefore, the nodes in the visual content structure generated after the segmentation are always semantic units, contains certain semantics. For example, in figure 1 (a), we can see that vb2_1_1 represents the directory link of the Yahoo pet store, while vb2_2_1 and vb2_2_2 indicate two different comics. 4. VIPs Algorithm Description

This section describes the VIPs Algorithm in detail. In general, the visual content structure of the page is obtained by combining the DOM tree and some visual prompt information. The entire paging process can be described in Figure 2. It has three steps: Page block extraction, separation bar extraction, and semantic Block Reconstruction. These three steps are combined as a complete step for semantic block detection. Web pages are first divided into several relatively large semantic blocks, and the layers of these semantic blocks are recorded. The paging process of each detected semantic block can continue until the doc value of the semantic block reaches the preset pdoc value. In each iteration loop, the DOM tree structure of the Current Logical Block and its visual information are obtained. Then, starting from the root node of the DOM tree, the logical block detection process starts to detect page blocks from the DOM tree based on visual information. Each Dom node (node 1, 2, 3, 4, 5, 6, and 7 in Figure 3b) is checked to form a separate page block. If no, such as node 1, 3, and 4 in node 3B, its subnode will be checked similarly. For each extracted page block, we will assign a doc value based on the internal visual attribute of the current page block for nodes 2, 5, 6, and 7 in 3B. When all the page blocks are detected during this iteration, they are saved to the page block pool. Based on these page blocks, the separation bar detection process starts to work. All horizontal and vertical separation bars between these page blocks will eventually be identified and given a certain width and height. Based on these delimiters, the page layout layers will be rebuilt-some page blocks will be merged to form semantic blocks. Finally, all semantic blocks in this iteration are detected. Whether or not the iteration process needs to continue depends on whether there is a semantic block whose Doc value is smaller than pdoc in the semantic block of this level. For semantic blocks with Doc> = pdoc, the separation process stops; otherwise, the separation process continues. For example, if the doc value of semantic Block C is smaller than pdoc, the semantic block will be used as a new sub-web page, and the segmentation algorithm will be executed, which is finally divided into two parts: c1, C2, 4A, and 4B are shown below:

After all semantic blocks are extracted, the visual content structure of the entire web page is finally built. In the above example, the final obtained Content Level 5 is shown. In the following section, we will describe the semantic block detection, separator detection, and content structure reconstruction process in detail. 4.1 semantic block ExtractionIn this step, we aim to extract all the visual semantic blocks contained in the current Child page. Generally, each node in the DOM tree can represent a visual semantic block. However, in HTML, some labels such as <Table> and <p> are usually used for data organization. Therefore, they are not suitable for expressing separate visual semantic blocks. For such nodes, their extraction will be replaced by their child nodes. Due to the flexibility of the HTML syntax, many web pages do not strictly follow the W3C HTML specification. As a result, the DOM tree cannot always reflect the relationship between different DOM nodes. For each extracted visual semantic block, we will set its Doc value based on its internal visual differences. The following algorithm can be used to describe the entire iterative extraction process:

How can we determine whether a given node can be further divided? The following aspects are provided: 1) attributes of the DOM node. For example, the label of the current Dom node, the background color of the node, the size and shape of the page block represented by the current node. 2) The child node of the current Dom node. For example, the label of the child node, the background color of the area represented by the child node, the foreground color, the size of the area, and the number of different types of children. Based on www html specification 4.0, we divide DOM nodes into two categories: inline nodes and line-break nodes. The so-called inline node is: if the label of this node can affect the appearance of the text and does not cause line breaks, then such nodes are called inline nodes, such as <B>, <big>, <em>, <font>, <I>, <strong>, and <u>, this type of node usually only affects the appearance of the text and does not affect the layout of the text. The line-break node is all nodes except the inline node. In addition, based on the display of various nodes in the browser and the child node attributes of the nodes, we provide the following definitions: 1), valid node (valid node ): if a node can be displayed in the browser, the node is a valid node. Generally, the length and width of the valid node are not zero. In addition, if a node does not have any useful information, it is also called an invalid node. The second and fourth tr nodes in 7 are invalid. 2) text nodes: these nodes generally refer to the text in HTML and are not surrounded by any labels. 3) virtual text node (this definition is a recursive definition). If a node is a text node, it is naturally a virtual text node. If a node is an inline node, all its subnodes are not text nodes, but virtual text nodes. If a text is added with labels such as <B>, <big>, and <I>, the text will only change the display appearance in the browser, it does not affect the text attribute. VIPs calls it a virtual text node, which is not hard to understand. The dividedomtree6 Algorithm for Extracting Visual semantic blocks is shown in. Some important information in this algorithm can be used to generate speculative rules: rule label tip 1). Some labels such as <HR> are usually used to visually separate the content of different topics. Therefore, if the DOM node contains these labels, we tend to think that the node can be further split. 2) If the child node of the inline node has a line-break node, the node tends to be split. The background color indicates that if the background color of a node in the child node of the current node is different from the background color of the node, we tend to split the DOM node. Nodes with different background colors are not separated in this cycle. Segmentation is completed by the next iteration. The forward text indicates that if most of the child nodes at the current node are text nodes or virtual text nodes, we tend to stop splitting the nodes. The partition size prompt indicates that we can predefine a threshold size for different node types (the size of the node is compared with the size of the entire page). If the relative size of the node is smaller than the threshold size, the split will stop. Based on the above prompt information, we provide some inference rules to determine whether the current node should be split. If a node does not need to be split, the node block is extracted, the corresponding Doc value is set, and saved to the page block pool. The following table lists the inference rules:
Rule 1 If the current node is not a text node and does not have any valid child node, the node will not be split and will be deleted from the node set.
Rule 2 If the current node has only one valid child node and the child node is not a text node, the current node is split.
Rule 3 If the current Dom node is the root node of the entire sub-DOM tree (corresponding to the page block), and only one sub-DOM tree is associated with the current page block, the node is split.
Rule 4 If all the child nodes of the current node are text nodes or virtual text nodes, they are not separated. If the font size and font weight of all current child nodes are the same, set the doc value of the page block to 10; otherwise, set it to 9.
Rule 5 If there is a line-break node in the child node of the current Dom node, the node will be further split.
Rule 6 If the <HR> node exists in the child node of the current node, the node is further split.
Rule 7 If the background color of the current node is different from that of all its child nodes, the child nodes with different colors are separated in this iteration, segmentation is performed in the next iteration. At the same time, the doc value of the child node is set to 6-8 according to the label and size.
Rule 8 If a node has at least one text or virtual text subnode and the relative size of the node is smaller than the threshold, the node is no longer separated, the doc value is set to 5-8.
Rule 9 If the maximum size of all the child nodes of the current node is smaller than the threshold, the node will not be split, and the doc value is set based on the HTML Tag and node size.
Rule 10 If the previous sibling node is not split, the node will not be split.
Rule 11 Split this node
Rule 12 Do not separate the node, and set the doc value based on the label and size of the current node.
For different DOM nodes, we use different inference rules:

Let's consider the situation in Figure 1. During the first page block extraction process, VB1, vb2_1, vb2_2, vb2_3, vb3, and vb4 are finally extracted and then put into the semantic block pool. The extraction process of vb2_1, vb2_2, and vb2_3 is described in detail below. Figure 7 (B) shows a table that is part of the entire web page. Its DOM tree structure is displayed on the left. During page block extraction, when a <Table> node is encountered, it only has one valid child node <tr>. According to Rule 2, we enter the <tr> label. The <tr> node has five <TD> child nodes, but only three of them are valid nodes. The background color of the first child node is different from that of the father node. According to rule 8, the <tr> node is split, and the first <TD> node is split in the middle of this iteration. The first <tr> node is saved to the page block pool. The second and fourth <tr> nodes are invalid, so they are deleted. For the third and fifth <TD> nodes, we will not separate them in this iteration according to the inference rule 11. Therefore, we finally get three page blocks vb2_1, vb2_2, and vb2_3.

4.2 separator DetectionAfter all the page blocks are extracted, they are saved in the page block pool for separation detection. In the VIPs algorithm, the separator is a vertical or horizontal line in the web page. From a visual perspective, separators are good indicators for discriminating different semantics within the page. In VIPs, a visible separator bar can be described by a two-dimensional vector (Ps, PE). PS is the starting coordinate of the separator bar, while PE is the ending coordinate of the separator bar. All coordinates are in pixels. Based on PS and PE, it is easy to calculate the width and height of the current separator bar. 4.2.1 separator DetectionThe separator detection algorithm is described as follows: 1) initialize the separator list. There is only one Separator in the earliest separator list. Its starting and ending coordinates are (PBE, pee), which correspond to the starting and ending coordinates of the entire web page respectively. 2) for each page block in the page block pool, its relationship with the separation bar includes the following three types: ■ The page block is included in the separation bar. At this time, the separator is split from the edge of the page block into multiple separators. ■ The part of the page block and the separator bar coincide. Then, adjust the parameters of the separator bar based on the boundary of the page block ■ The page block spans the separator bar, then the separator is removed at this time. 3) Remove the four separation bars at the edge of the page. Figure 8 shows the detection process of the separation bars. For the sake of simplicity, we only demonstrate the detection process of the horizontal separator. At the beginning, we have a large separator. Its starting and ending positions are the starting and ending positions of the entire page. When we put the first page block into the pool, because the page block is included in the separator bar, the original separator bar will be split into S1 and S2. Similarly, when the second and third page blocks are placed in the pool, four separation bars S1, S2, S3, and S4 are detected. When the fourth page block is placed in the pool, it spans S3 separators and partially overlaps with S2 separators. In this case, S3 separators are deleted and S2 is adjusted, we can see that after the adjustment, S3 is obviously refined.

4.2.2 set the separator weightSeparation bars are usually used to differentiate page blocks with different semantics. Therefore, based on the visual differences between semantic blocks on both sides of a given separation bar, we can set the weight of the separation bars. If the weight of a separator is heavier, the possibility of the separator becoming a separator increases. The following rules can be used to set the weight of a separator bar: 1) the further the distance between the page blocks on both sides of the separator bar, the higher the weight of the separator bar. 2) if a separator is obtained by detecting HTML tags, for example, <HR>, the weight of the separator is higher. 3) if the background color of the page blocks on both sides of the separator bar is different, the weight of the separator bar will increase accordingly. 4) for horizontal separators, if the font attributes of the page blocks on both sides of the separator, such as the font size and font weight are different, the weight of the separator increases. In addition, if the font of the page block on the upper side of the separator bar is smaller than that of the page block on the lower side of the separator bar, the weight of the separator bar will increase. 5) for horizontal separation bars, when the structure of the page blocks on both sides of the separation bars is very similar, such as text, the weight of the separation bars will decrease. Consider the third <TD> in figure 7. The child page 9 (B) corresponding to the node is shown in the DOM tree 9 (. We can see that according to our definition, many nodes in the DOM tree are invalid and cannot be displayed in the browser. The page blocks are extracted and ignored. After these page blocks are extracted, the six page blocks are saved to the pool, and the five horizontal separators are also detected. At the same time, based on the preceding five separation rules, the weights of these separation rules will be set. In this example, the split entries between page blocks 2 and 3 have a higher weight than the split entries between page blocks 1 and page 2, because the font is different. For the same reason, the separation bar weight between 4 and 5 is also higher. The final separator and their weight 9 (c) are shown.

4.2.3 content structure constructionAfter the separator is detected and the weight is set, the content reconstruction process starts. The construction process starts from the separation bar with the minimum weight. The page blocks on both sides of the separation bar are merged to form a new page block. The merge process continues to iterate until the separator with the highest weight is met. For each new semantic block, the corresponding Doc is also set. After the page blocks are merged into semantic blocks, the current iteration ends. For these semantic blocks, the doc of each semantic block will be compared with pdoc. If the doc value is smaller than pdoc, the new iteration process will start again: Page block detection, separation bar detection and content structure reconstruction. When the doc value of all semantic blocks is no greater than pdoc, the iteration will stop. At the same time, the content structure of the entire web page will be built. Take Fig. 9 as an example. In the first iteration, the first, third, and fifth separators are selected. Meanwhile, page Blocks 1 and 2 are merged into the new semantic blocks vb2_2_2_1. The same merge occurs on page Blocks 3 and 4. They are merged into a new semantic block vb2_2_2_2, and page blocks 5 and 6 are merged into vb2_2_2_3. The new semantic blocks vb2_2_2_1, vb2_2_2_2, and vb2_2_2_3 are subnodes of the semantic block vb2_2_2. For each page node, such as vb2_2_2_1_1, vb2_2_2_1_1, and vb2_2_2_2_1, their Doc values are checked to determine whether pdoc values are satisfied. The final content structure is built.
Reprinted please indicate the source: http://blog.csdn.net/tingya if you think this article is good, please click the "recommended this article" link after the article !!

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.