Search engine research-network Spider Program Algorithm-related information Part VI (5 parts in total)

Source: Internet
Author: User
Tags keyword list

Search engine research --- network Spider Program Algorithm

1. parse HTML files

Here are two methods for parsing HTML files to find a href-a troublesome method and a simple method.

If you choose a troublesome method, you will use the Java streamtokenizer class to create your own parsing rules. To use these technologies, you must specify words and spaces for the streamtokenizer object, remove the <and> symbols to search for tags, attributes, and separate text between tags. Too much work to do.

The simple method is to use the built-in parserdelegator class, a subclass of the htmleditorkit. parser abstract class. These classes are not well documented in the Java documentation. There are three steps to use parserdelegator: first create an inputstreamreader object for your url, then create an instance of parsercallback object, and finally create an instance of parserdelegator object and call its public method parse ():

Urltreenode newnode = new urltreenode (URL); // create the data node
Inputstream in = URL. openstream (); // ask the URL object to create an input stream
Inputstreamreader ISR = new inputstreamreader (in); // convert the stream to a reader
Defaultmutabletreenode treenode = addnode (parentnode, newnode );
Spiderparsercallback cb = new spiderparsercallback (treenode); // create a callback object
Parserdelegator Pd = new parserdelegator (); // create the delegator
PD. parse (ISR, CB, true); // parse the stream
ISR. Close (); // close the stream
Parse () accepts an inputstreamreader, A parsecallback object instance, and a flag indicating whether the charset label is ignored. The parse () method then reads and decodes the HTML file. Each time a tag or HTML element is decoded, The parsercallback object method is called.

In the sample code, I implemented parsercallback as an internal class of the spider, so that parsecallback can access the method and attributes of the spider. The parsercallback-based class can overwrite the following method:

■ Handlestarttag (): called when a starting HTML Tag is encountered, for example,> A <

■ Handleendtag (): called when an HTML Tag ends, for example,>/A <

■ Handlesimpletag (): called when no end tag is matched

■ Handletext (): called when there is text between tags

In the sample code, I overwrite handlesimpletag () so that my code can process HTML base and IMG tags. The base tag indicates the URL used to process the related URL reference. If no base tag appears, the current URL is used to process the reference. Handlesimpletag () accepts three parameters: an HTML. Tag object, a mutableattributeset containing all tag attributes, and the corresponding position in the file. My code checks the tag to determine whether it is a base object instance. If so, the href attribute is extracted and saved to the data node on the page. This attribute will be used later in the URL address of the link site. The number of page images is updated every time you encounter an IMG tag.

I overwrite the handlestarttag so that the program can process the and title tags of HTML. Method: Check whether the t parameter is a real a tag. If yes, the href attribute is extracted.

Fixhref () is used to clear a large number of references (change the backslash to a slash and add a missing ending slash). the URL of a link is processed by using the basic URL and reference to create a URL object. Then, recursively call searchweb () to process the link. If the method encounters a title tag, it clears the variable that stores the last encountered text so that the title ending tag has the correct value (sometimes there is no title between the title tags on the webpage ).

I overwrite handleendtag () so that the HTML title end tag can be processed. This end mark indicates that the preceding text (in lasttext) is the title Text of the page. This text is then stored in the data node on the page. The nodechanged () method must be called to update the tree.

I overwrite the handletext () method so that the text on the HTML page can be checked based on any searched keyword or phrase. Handletext () accepts an array containing a sub-character and its position in the file as a parameter. Handletext () first converts the character array into a string object, in which case all are converted to uppercase. Then, each keyword/phrase in the search list is checked according to the indexof () method of the string object. If indexof () returns a non-negative result, the keyword/phrase is displayed in the text on the page. If the keyword/phrase is displayed, the match is recorded in the node of the match list, and the statistics are updated:

Public class spiderparsercallback extends htmleditorkit. parsercallback {

/**

* Inner class used to HTML handle parser callbacks

*/

Public class spiderparsercallback extends htmleditorkit. parsercallback {

/** URL node being parsed */

Private urltreenode node;

/** Tree node */

Private defaultmutabletreenode treenode;

/** Contents of last text element */

Private string lasttext = "";

/**

* Creates a new instance of spiderparsercallback

* @ Param atreenode Search Tree node that is being parsed
*/

Public spiderparsercallback (defaultmutabletreenode atreenode ){

Treenode = atreenode;
Node = (urltreenode) treenode. getuserobject ();

}

/**
* Handle HTML tags that don't have a start and end tag
* @ Param t HTML Tag
* @ Param a HTML attributes
* @ Param POS position within file
*/
Public void handlesimpletag (html. Tag t,

Mutableattributeset,
Int POS)

{
If (T. Equals (html. Tag. IMG ))

{
Node. addimages (1 );
Return;
}

If (T. Equals (html. Tag. Base ))
{
Object value = A. getattribute (html. Attribute. href );

If (value! = NULL)
Node. setbase (fixhref (value. tostring ()));
}
}

/**

* Take Care Of start tags

* @ Param t HTML Tag

* @ Param a HTML attributes

* @ Param POS position within file
*/
Public void handlestarttag (html. Tag t,

Mutableattributeset,

Int POS)
{
If (T. Equals (html. Tag. Title ))
{

Lasttext = "";
Return;

}

If (T. Equals (html. Tag. ))

{

Object value = A. getattribute (html. Attribute. href );
If (value! = NULL)
{
Node. addlinks (1 );
String href = value. tostring ();
Href = fixhref (href );
Try {
URL referencedurl = new URL (node. getbase (), href );
Searchweb (treenode, referencedurl. getprotocol () + ": //" + referencedurl. gethost () + referencedurl. getpath ());
}
Catch (malformedurlexception E)

{
Messagearea. append ("Bad URL encountered:" + href + "/n"); return;
}
}
}
}
/**
* Take Care Of start tags
* @ Param t HTML Tag
* @ Param POS position within file

*/
Public void handleendtag (html. Tag t,
Int POS)

{
If (T. Equals (html. Tag. Title) & lasttext! = NULL)
{
Node. settitle (lasttext. Trim ());
Defaulttreemodel TM = (defaulttreemodel) searchtree. GetModel ();

TM. nodechanged (treenode );

}

}

/**

* Take care of text between tags, check against keyword list for matches, if
* Match found, set the node match status to true
* @ Param data text between tags
* @ Param POS position of text within webpage
*/
Public void handletext (char [] data, int POS)
{

Lasttext = new string (data );
Node. addchars (lasttext. Length ());
String text = lasttext. touppercase ();
For (INT I = 0; I <keywordlist. length; I ++)
{
If (text. indexof (keywordlist)> = 0)
{
If (! Node. ismatch ())
{
Sitesfound ++;
Updatestats ();
}
Node. setmatch (keywordlist );
Return;
}
}
}

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.