Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

Source: Internet
Author: User

Now ACM does not engage, a look at an article published time is February 13, really startled myself. Since the internship has begun, then engage in technology, but the algorithm can not break the study, the introduction of the algorithm and a lot of knowledge did not learn it well, since the assignment during the internship is to engage in a crawler, then I would like to say the Java HTML Parser specific analytic way it. First Network crawler general operation principle everyone should know, if do not know the words to Google, a search a lot, I do not repeat here. What I want to say is how the specific parsing process works, see the following section of code, no, two pieces of code:

1InputStream in =Url.openstream ();2 3InputStreamReader ISR =NewInputStreamReader (In, This. CharSet);//Gets the entire contents of an HTML page. 4 5ParentNode =Getparentnode (URLSTR);6Defaultmutabletreenode treenode =AddNode (parentnode, newnode);7Spiderparsercallback cb =NewSpiderparsercallback (TreeNode);//Declares the parser callback object, all the callback methods in it represent a kind of processing to a particular label, what does the specific callback mean? Detailed description. 8 9Parserdelegator PD =NewParserdelegator (); Declares the parser. TenPd.parse (ISR, CB,true); The parser begins parsing, the first parameter represents the parsed content, the second parameter indicates that a certain label is parsed, and a callback method is called.
1  Public classSpiderparsercallbackextendsHtmleditorkit.parsercallback {//All methods of this class do the corresponding actions for the tags resolved by the parser. 2         Privateurltreenode node;3         PrivateDefaultmutabletreenode TreeNode;4         PrivateString Lasttext = "";5 6          PublicSpiderparsercallback (Defaultmutabletreenode atreenode) {7              This. TreeNode =Atreenode;8              This. Node = ((Urltreenode) This. Treenode.getuserobject ());9         }Ten  One          Public voidHandlesimpletag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Special parsing of simple tags.  A             if(T.equals (HTML. tag.img)) { -                  This. node.addimages (1); -                 return; the             } -             if(T.equals (HTML. Tag.base)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); -                 if(Value! =NULL) { +                      This. Node.setbase (Spider.fixhref (value.tostring ())); -                 } +             } A         } at  -          Public voidHandlestarttag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Specifically parse a complex field with a start end tag.  -             if(T.equals (HTML. Tag.title)) { -                  This. Lasttext = ""; -                 return; -             } in             if(T.equals (HTML. TAG.A)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); to                 if(Value! =NULL) { +                      This. node.addlinks (1); -String href =value.tostring (); thehref =spider.fixhref (href); *                     if(Href.contains ("javascript:")) { $                         return;Panax Notoginseng                     } -                     Try { theURL Referencedurl =NewURL ( This. Node.getbase (), href); +Spider. This. Searchweb ( This. TreeNode, AReferencedurl.getprotocol () + "://" the+referencedurl.gethost () ++Referencedurl.getpath ()); -}Catch(malformedurlexception e) { $Spider. This. Messagearea $. Append ("Bad URL encountered 2:" +href -+ "\ n"); -                         return; the                     } -                 }Wuyi             } the}

The above two code shows that the parsing process is roughly, parser parser parse out the HTML content, classified into various tags, and then various tags callback parser in the various corresponding methods to finally implement the parser workflow. These are just my shallow understanding, certainly not very thorough, and the expression is not very clear, but still please a lot of criticism correct, common progress!

Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.