Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

Last Update:2015-01-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now ACM does not engage, a look at an article published time is February 13, really startled myself. Since the internship has begun, then engage in technology, but the algorithm can not break the study, the introduction of the algorithm and a lot of knowledge did not learn it well, since the assignment during the internship is to engage in a crawler, then I would like to say the Java HTML Parser specific analytic way it. First Network crawler general operation principle everyone should know, if do not know the words to Google, a search a lot, I do not repeat here. What I want to say is how the specific parsing process works, see the following section of code, no, two pieces of code:

1InputStream in =Url.openstream ();2 3InputStreamReader ISR =NewInputStreamReader (In, This. CharSet);//Gets the entire contents of an HTML page. 4 5ParentNode =Getparentnode (URLSTR);6Defaultmutabletreenode treenode =AddNode (parentnode, newnode);7Spiderparsercallback cb =NewSpiderparsercallback (TreeNode);//Declares the parser callback object, all the callback methods in it represent a kind of processing to a particular label, what does the specific callback mean? Detailed description. 8 9Parserdelegator PD =NewParserdelegator (); Declares the parser. TenPd.parse (ISR, CB,true); The parser begins parsing, the first parameter represents the parsed content, the second parameter indicates that a certain label is parsed, and a callback method is called.

1  Public classSpiderparsercallbackextendsHtmleditorkit.parsercallback {//All methods of this class do the corresponding actions for the tags resolved by the parser. 2         Privateurltreenode node;3         PrivateDefaultmutabletreenode TreeNode;4         PrivateString Lasttext = "";5 6          PublicSpiderparsercallback (Defaultmutabletreenode atreenode) {7              This. TreeNode =Atreenode;8              This. Node = ((Urltreenode) This. Treenode.getuserobject ());9         }Ten  One          Public voidHandlesimpletag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Special parsing of simple tags.  A             if(T.equals (HTML. tag.img)) { -                  This. node.addimages (1); -                 return; the             } -             if(T.equals (HTML. Tag.base)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); -                 if(Value! =NULL) { +                      This. Node.setbase (Spider.fixhref (value.tostring ())); -                 } +             } A         } at  -          Public voidHandlestarttag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Specifically parse a complex field with a start end tag.  -             if(T.equals (HTML. Tag.title)) { -                  This. Lasttext = ""; -                 return; -             } in             if(T.equals (HTML. TAG.A)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); to                 if(Value! =NULL) { +                      This. node.addlinks (1); -String href =value.tostring (); thehref =spider.fixhref (href); *                     if(Href.contains ("javascript:")) { $                         return;Panax Notoginseng                     } -                     Try { theURL Referencedurl =NewURL ( This. Node.getbase (), href); +Spider. This. Searchweb ( This. TreeNode, AReferencedurl.getprotocol () + "://" the+referencedurl.gethost () ++Referencedurl.getpath ()); -}Catch(malformedurlexception e) { $Spider. This. Messagearea $. Append ("Bad URL encountered 2:" +href -+ "\ n"); -                         return; the                     } -                 }Wuyi             } the}

The above two code shows that the parsing process is roughly, parser parser parse out the HTML content, classified into various tags, and then various tags callback parser in the various corresponding methods to finally implement the parser workflow. These are just my shallow understanding, certainly not very thorough, and the expression is not very clear, but still please a lot of criticism correct, common progress!

Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support