Implementation and demonstration of a principle for splitting and slicing web pages for search engines
I recently saw the PPT of South China Kapo Information Retrieval by ou Jianwen, Chief of South China Kapo information retrieval at the National Search Engine and online information mining academic seminar in 2005. Very enlightening.
Therefore, you are also prepared to implement it based on your understanding.
Prerequisites:
1. The basic unit of Web Page Segmentation is table and Div tags in HTML (currently only table and Div tags are supported ).
2. Slice identification of web pages depends on comparison of similar URLs. For example, we think the HTML text structure of two URLs is similar:
Http://news.soufun.com/2005-11-26/580107.htm
Http://news.soufun.com/2005-11-26/580175.htm
The following two URLs have different webpage structures:
Http://news.soufun.com/subject/weekly051121/index.html
Http://news.soufun.com/2005-11-26/580175.htm
Purpose:
1. Identify whether a webpage is a topical webpage or a directory webpage based on the analysis webpage structure;
2. Identify the subject content, related content, and noise content of the webpage based on the analysis webpage structure;
Three phases of implementation:
1. Reasonably segment the webpage structure;
2. Compare the Slice Structure of similar web pages;
3. Analyze the slice data and draw a conclusion.
Demo address:
Http://www.domolo.com: 8090/domoloweb/html-page-slice.jsp