Some problems of webpage slicing algorithm

Source: Internet
Author: User
Tags object client
Algorithm | Web page | question

This is my study of the Web page slicing algorithm, a summary of ideas.

I have written before: A search engine-oriented web page chunking, slicing the principle of implementation and demonstration, with the deepening of the work, and gradually encountered the following problems:

Granularity of page slices:

The purpose of the web slicing algorithm is not to find exactly what is needed, but to identify various functional areas, navigation areas, link areas, content, footer areas, and advertising areas that divide the pages.

Page objects for a page slice:

There are about 2 different types of Web pages that interconnect the Internet. Directory type and content type; With the development of the search engine, the website structure gradually to the flat direction development, Che Dong has also made the data verification, and with the display resolution unceasing enhancement, the content and the Catalog combination type page appears to increase the trend, pole's webpage involves, Can be said to be one of the examples.

The object of the page slicing algorithm should be for: content-and Content-catalog blends. For different web pages, there should be a recognition algorithm, which standards should be included?

Page content area to identify the widest range:

The granularity of the slices shows that the content area should be cut out as a single part. According to the general rules of web design, there are generally 2 ways to accommodate the content area: 1, including the type (such as blog) 2, parallel type (such as BBS posts).

If you are working with paging content-type Web pages:

Now most of the Web site in order to improve the user experience and increase the number of pages to display the need to do a paging page, this part needs to be set up.

Inadvertently saw: VIPS: A web-based page paging algorithm based on vision, theoretically proved the feasibility of this method. But there are many obstacles to achieving it, as this says:

Snail Posted in 2006-02-21 12:40 AM ip:220.184.129.*

I used the floating position instead of positioning with absolute positioning, and dynamically arranged in the JavaScript of the client. The client's object dynamically generates inserts with the script.

Kill him. See how he analyses.

This algorithm is too dependent on the implementation of the specific, it is difficult to have a good solution.

What's more, relying on client-side scripting to show that dynamic expressiveness is slowly starting to pop, this algorithm is difficult to adapt to future trends.

Take the simplest, I have a page style similar to the Outlook toolbar, which is generated by the script, I see how he analyzes!

Visual analysis can only be settled into the visual, only the static picture of the page to get the correct segmentation, segmentation is easy simple algorithm can be done, but to the content to be divided into the bar is difficult.

There is only one good way, simulate the mouse keystroke, the keystroke at the object return response, which can be achieved in IE. So as to get the object belonging after dividing the bar.

I think my simple algorithm is much better than the algorithm in the text.

Vision depends on the screen segmentation bar, very simple, to the blank expansion-narrowing algorithm, so that the white can gradually clear out, the text is blurred. Then make the fuzzy processing, then use a brightness threshold to convert the picture into a two-value graph, and then make a vector processing, leaving the line. It comes down to 90 degrees and 0 degrees and gets a segmented vector graph.

Then, each piece in accordance with the density of the mouse click Simulation to get the object! This completes the compartmentalization.

Why do you want to parse HTML? The situation is so much that it can't be analyzed at all.

My current progress is: You can identify the navigation area, link area, footer area.

The analysis of the content area is a difficult problem, in view of my own needs, as long as the largest content area can be found.

One of the experiences of this time is that the algorithm is the solution to the specific problem. Most of the algorithms in the textbook are the most general, common methods of the description.

A good way to solve practical problems is to have a saying. But it is still a basic requirement to improve the algorithm level by using the mathematical model to express the problem we solve.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.