VIPs: A vision-based page segmentation algorithm.pdf download
The main idea of this paper:
From a human perspective, when a user observes a Web page, it will naturally treat a semantic block as a single object, it does not care how the internal structure of the web page is described. Therefore, some visual factors, such as background color, font color and size, borders, spacing between logical blocks, and logical blocks, can be used to distinguish semantic blocks and fully utilize Visual prompts on Web pages, combined with the DOM tree, the page semantic block can achieve better results. The VIPs algorithm extracts all the appropriate page blocks from the DOM tree, and then detects all the split entries between them based on these page blocks, including the horizontal and vertical directions. Finally, based on these split entries, the semantic structure of the web page will be rebuilt. You can use the VIPs algorithm to separate each semantic block into smaller semantic blocks.
VIPs algorithm Flowchart
Node segmentation is based on:
1. attributes of the DOM node. For example, the label of the current Dom node, the background color of the node, the size and shape of the page block represented by the current node.
2. child node of the current Dom node. For example, the label of the child node, the background color of the area represented by the child node, the foreground color, the size of the area, and the number of different types of children.
The node separation principles are as follows::
1. Tag prompt
Some labels, such as <HR>, are usually used to visually separate the content of different topics. Therefore, if the DOM node contains these labels, this node can be further split;
If the child node of the inline node has a line-break node, the node is tends to be split;
2. Color prompt
If the background color of a node in the child node of the current node is different from the background color of the node, split the DOM node. At the same time, nodes with different background colors are not separated in this loop. Segmentation is completed by the next iteration.
3. Text prompt
If most of the Child Nodes of the current node are text nodes or virtual text nodes, the node will not be split.
4. Dimension prompt
Predefine a threshold size for different node types (the size of the node is compared with the size of the entire page). If the relative size of the node is smaller than the threshold size, the split will stop.
Follow these steps:
1. initialize the separator list. There is only one Separator in the earliest separator list. The start and end coordinates are (PBE, pee), which correspond to the start coordinates and end coordinates of the entire web page respectively.
2. The page block is included in the separator bar. In this case, the separator bar is split from the edge of the page block into multiple separators.
3. When the page block and the separator bar partially overlap, adjust the separator Parameters Based on the page block boundary.
4. If the page block spans a separator bar, remove the separator bar.
5. Remove the separation bar at the edge of the page
Split detection steps
Semantic block reconstruction process:
Starting from the separator with the minimum weight, the page blocks on both sides of the separator are merged to form a new page block. The merge process continues to iterate until the separator with the highest weight is met. For each new semantic block, the corresponding Doc is also set. See:
In addition, separation bars are used to differentiate page blocks with different semantics. Therefore, the weights of separation bars are set based on the visual differences between semantic blocks on both sides of the given separation bars. If the weight of a separator is heavier, the page on both sides of the separator is more likely to belong to different semantic blocks. The following principles are used to set the weight of a shard:
1. The farther the page blocks on both sides of a separator are, the higher the weight of the separator.
2. If a separator is obtained by detecting HTML tags, such as <HR>, the weight of the separator is higher.
3. If the background color of the page blocks on both sides of the separator bar is different, then the weight of the separator bar will increase accordingly.
4. For horizontal separation bars, if the font attributes of the page blocks on both sides of the separation bars, such as the font size and font weight are different, then the weight of the separation bar will increase. And if the size of the page block on the upper side of the separator bar is smaller than the font of the page block on the lower side of the separator bar, the weight of the separator bar will increase.
5. For horizontal separation bars, when the structure of the page blocks on both sides of the separation bars is very similar, such as text, the weight of the separation bars will decrease