International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Search engine research-network Spider Program Algorithm-related information Part VI (5 parts in total)

Last Update:2018-12-05 Source: Internet

Author: User

Tags keyword list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Search engine research --- network Spider Program Algorithm

1. parse HTML files

Here are two methods for parsing HTML files to find a href-a troublesome method and a simple method.

If you choose a troublesome method, you will use the Java streamtokenizer class to create your own parsing rules. To use these technologies, you must specify words and spaces for the streamtokenizer object, remove the <and> symbols to search for tags, attributes, and separate text between tags. Too much work to do.

The simple method is to use the built-in parserdelegator class, a subclass of the htmleditorkit. parser abstract class. These classes are not well documented in the Java documentation. There are three steps to use parserdelegator: first create an inputstreamreader object for your url, then create an instance of parsercallback object, and finally create an instance of parserdelegator object and call its public method parse ():

Urltreenode newnode = new urltreenode (URL); // create the data node
Inputstream in = URL. openstream (); // ask the URL object to create an input stream
Inputstreamreader ISR = new inputstreamreader (in); // convert the stream to a reader
Defaultmutabletreenode treenode = addnode (parentnode, newnode );
Spiderparsercallback cb = new spiderparsercallback (treenode); // create a callback object
Parserdelegator Pd = new parserdelegator (); // create the delegator
PD. parse (ISR, CB, true); // parse the stream
ISR. Close (); // close the stream
Parse () accepts an inputstreamreader, A parsecallback object instance, and a flag indicating whether the charset label is ignored. The parse () method then reads and decodes the HTML file. Each time a tag or HTML element is decoded, The parsercallback object method is called.

In the sample code, I implemented parsercallback as an internal class of the spider, so that parsecallback can access the method and attributes of the spider. The parsercallback-based class can overwrite the following method:

■ Handlestarttag (): called when a starting HTML Tag is encountered, for example,> A <

■ Handleendtag (): called when an HTML Tag ends, for example,>/A <

■ Handlesimpletag (): called when no end tag is matched

■ Handletext (): called when there is text between tags

In the sample code, I overwrite handlesimpletag () so that my code can process HTML base and IMG tags. The base tag indicates the URL used to process the related URL reference. If no base tag appears, the current URL is used to process the reference. Handlesimpletag () accepts three parameters: an HTML. Tag object, a mutableattributeset containing all tag attributes, and the corresponding position in the file. My code checks the tag to determine whether it is a base object instance. If so, the href attribute is extracted and saved to the data node on the page. This attribute will be used later in the URL address of the link site. The number of page images is updated every time you encounter an IMG tag.

I overwrite the handlestarttag so that the program can process the and title tags of HTML. Method: Check whether the t parameter is a real a tag. If yes, the href attribute is extracted.

Fixhref () is used to clear a large number of references (change the backslash to a slash and add a missing ending slash). the URL of a link is processed by using the basic URL and reference to create a URL object. Then, recursively call searchweb () to process the link. If the method encounters a title tag, it clears the variable that stores the last encountered text so that the title ending tag has the correct value (sometimes there is no title between the title tags on the webpage ).

I overwrite handleendtag () so that the HTML title end tag can be processed. This end mark indicates that the preceding text (in lasttext) is the title Text of the page. This text is then stored in the data node on the page. The nodechanged () method must be called to update the tree.

I overwrite the handletext () method so that the text on the HTML page can be checked based on any searched keyword or phrase. Handletext () accepts an array containing a sub-character and its position in the file as a parameter. Handletext () first converts the character array into a string object, in which case all are converted to uppercase. Then, each keyword/phrase in the search list is checked according to the indexof () method of the string object. If indexof () returns a non-negative result, the keyword/phrase is displayed in the text on the page. If the keyword/phrase is displayed, the match is recorded in the node of the match list, and the statistics are updated:

Public class spiderparsercallback extends htmleditorkit. parsercallback {

/**

* Inner class used to HTML handle parser callbacks

Public class spiderparsercallback extends htmleditorkit. parsercallback {

/** URL node being parsed */

Private urltreenode node;

/** Tree node */

Private defaultmutabletreenode treenode;

/** Contents of last text element */

Private string lasttext = "";

/**

* Creates a new instance of spiderparsercallback

* @ Param atreenode Search Tree node that is being parsed
*/

Public spiderparsercallback (defaultmutabletreenode atreenode ){

Treenode = atreenode;
Node = (urltreenode) treenode. getuserobject ();

}

/**
* Handle HTML tags that don't have a start and end tag
* @ Param t HTML Tag
* @ Param a HTML attributes
* @ Param POS position within file
*/
Public void handlesimpletag (html. Tag t,

Mutableattributeset,
Int POS)

{
If (T. Equals (html. Tag. IMG ))

{
Node. addimages (1 );
Return;
}

If (T. Equals (html. Tag. Base ))
{
Object value = A. getattribute (html. Attribute. href );

If (value! = NULL)
Node. setbase (fixhref (value. tostring ()));
}
}

/**

* Take Care Of start tags

* @ Param t HTML Tag

* @ Param a HTML attributes

* @ Param POS position within file
*/
Public void handlestarttag (html. Tag t,

Mutableattributeset,

Int POS)
{
If (T. Equals (html. Tag. Title ))
{

Lasttext = "";
Return;

}

If (T. Equals (html. Tag. ))

{

Object value = A. getattribute (html. Attribute. href );
If (value! = NULL)
{
Node. addlinks (1 );
String href = value. tostring ();
Href = fixhref (href );
Try {
URL referencedurl = new URL (node. getbase (), href );
Searchweb (treenode, referencedurl. getprotocol () + ": //" + referencedurl. gethost () + referencedurl. getpath ());
}
Catch (malformedurlexception E)

{
Messagearea. append ("Bad URL encountered:" + href + "/n"); return;
}
}
}
}
/**
* Take Care Of start tags
* @ Param t HTML Tag
* @ Param POS position within file

*/
Public void handleendtag (html. Tag t,
Int POS)

{
If (T. Equals (html. Tag. Title) & lasttext! = NULL)
{
Node. settitle (lasttext. Trim ());
Defaulttreemodel TM = (defaulttreemodel) searchtree. GetModel ();

TM. nodechanged (treenode );

}

/**

* Take care of text between tags, check against keyword list for matches, if
* Match found, set the node match status to true
* @ Param data text between tags
* @ Param POS position of text within webpage
*/
Public void handletext (char [] data, int POS)
{

Lasttext = new string (data );
Node. addchars (lasttext. Length ());
String text = lasttext. touppercase ();
For (INT I = 0; I <keywordlist. length; I ++)
{
If (text. indexof (keywordlist)> = 0)
{
If (! Node. ismatch ())
{
Sitesfound ++;
Updatestats ();
}
Node. setmatch (keywordlist );
Return;
}
}
}

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

create search engine in php simple search engine in php how to write search engine in php search engine source code in php how to make search engine in php how to build search engine in php network related commands in linux

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Search engine research-network Spider Program Algorithm-related information Part VI (5 parts in total)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support