Search engine research --- network Spider Program Algorithm
2. process and complete the URL
When you encounter links to related pages, you must create a complete link on their basic URLs. The base URL may be clearly defined in the page through the base tag, or implicitly included in the link of the current page. The Java URL object provides the constructor for you to solve this problem and creates a similar structure based on its link structure.
URL (URL context, string spec) accepts the spec parameter link and the basic link of the context parameter. If spec is a link, the builder uses context to create a fully referenced URL object. It is recommended that the URL follows a strict (UNIX) format. Using a backslash in Microsoft Windows, instead of a backslash, will be a wrong reference. If specw.contextrefers to a directory containing index.htmlor default.html, rather than an HTML file, it must have an ending slash. The fixhref () method checks these references and fixes them:
Public static string fixhref (string href)
{
String newhref = href. Replace ('//', '/'); // fix sloppy web references
Int lastdot = newhref. lastindexof ('.');
Int lastslash = newhref. lastindexof ('/');
If (lastslash> lastdot)
{
If (newhref. charat (newhref. Length ()-1 )! = '/')
Newhref = newhref + "/"; // Add missing/
}
Return newhref;
}
3. Control Recursion
Searchweb () is called to search for the starting web address specified by the user. It then calls itself when encountering html links. This forms the basis for deep Priority Search and brings about two problems. First, the memory/stack overflow problem is caused by too many recursive calls. If a circular reference occurs, this problem will occur. That is to say, the connection from one page Link to another link is common in www. To prevent this, searchweb () checks the search tree (using the urlhasbeenvisited () method) to determine whether the referenced page already exists. If it already exists, this link will be ignored. If you choose to implement a spider without a search tree, you must maintain a list of sites to access (in a vector or array) so that you can determine whether you are visiting the site repeatedly.
The second problem of recursion comes from the depth-first search and WWW structure. Based on the selected entry, deep-first search results in a large number of recursive calls before the initial link on the initial Page is processed. This results in two unnecessary results: memory/Stack Overflow may occur first, and the second page that has been searched may be deleted from the results of numerous initial portals for a long time. To control this, I added the maximum search depth settings for the spider. You can select the level of depth that can be reached (link to link). When each link is encountered, the current depth is checked by calling the depthlimitexceeded () method. If the limit is reached, the link is ignored. The test only checks the node levels in the jtree.
The sample program also adds site restrictions, which can be specified by users. You can stop searching after a specified number of URLs are checked, so that the program can be stopped at last! Site restrictions are controlled by a simple digital counter sitessearched, which is updated and checked every time searchweb () is called.
4. urltreenode and urlnoderenderer
Urltreenode and urlnoderenderer are classes used to create personalized Tree nodes in the jtree on the spidercontrol user interface. Urltreenode contains the URL information and statistical data of each searched site clock. Urltreenode is stored in jtree as a user object attribute standard defaultmutabletreenode object. Data includes the ability to trace the emergence of keywords in the node, the node URL, the basic URL of the node, the number of links, the number of images and characters, and whether the node meets the Search rules.
Urltreenoderenderer is the implementation of the defaulttreecellrenderer interface. Urltreenoderenderer displays the nodes in blue with matching keywords. Urltreenoderenderer also adds a personalized icon to jtreenodes. Personalized display is implemented by overwriting the gettreecellrenderercomponent () method (as follows. This method creates a component object in the tree. Most component attributes are set by subclass. urltreenoderenderer changes the text color (foreground color) and icons:
Public component gettreecellrenderercomponent (
Jtree tree,
Object value,
Boolean Sel,
Boolean expanded,
Boolean leaf,
Int row,
Boolean hasfocus ){
Super. gettreecellrenderercomponent (
Tree, value, Sel,
Expanded, leaf, row,
Hasfocus );
Urltreenode node = (urltreenode) (defaultmutabletreenode) value). getuserobject ());
If (node. ismatch () // set color
Setforeground (color. Blue );
Else
Setforeground (color. Black );
If (icon! = NULL) // set a custom icon
{
Setopenicon (icon );
Setclosedicon (icon );
Setleaficon (icon );
}
Return this;
}
5. Summary
This article shows you how to create a web spider and control its user interface. The user interface uses jtree to track spider progress and record visited sites. Of course, you can also use vector to record visited sites and use a simple counter to display progress. Other enhancements can include interfaces that record keywords and sites through databases, and increase the ability to search through multiple portals, so that sites can be displayed with a large or small amount of text, and provides synonymous search capabilities for search engines.
The spider class shown in this article uses recursive call to search programs. Of course, the independent thread of a new spider can start when every link is encountered. The advantage is that concurrent execution of remote URLs is allowed to speed up. However, it is not thread-safe to remember those jtree objects called defaultmutabletreenode, so programmers must implement synchronization on their own.