Now ACM does not engage, a look at an article published time is February 13, really startled myself. Since the internship has begun, then engage in technology, but the algorithm can not break the study, the introduction of the algorithm and a lot of knowledge did not learn it well, since the assignment during the internship is to engage in a crawler, then I would like to say the Java HTML Parser specific analytic way it. First Network crawler general operation principle everyone should know, if do not know the words to Google, a search a lot, I do not repeat here. What I want to say is how the specific parsing process works, see the following section of code, no, two pieces of code:
1InputStream in =Url.openstream ();2 3InputStreamReader ISR =NewInputStreamReader (In, This. CharSet);//Gets the entire contents of an HTML page. 4 5ParentNode =Getparentnode (URLSTR);6Defaultmutabletreenode treenode =AddNode (parentnode, newnode);7Spiderparsercallback cb =NewSpiderparsercallback (TreeNode);//Declares the parser callback object, all the callback methods in it represent a kind of processing to a particular label, what does the specific callback mean? Detailed description. 8 9Parserdelegator PD =NewParserdelegator (); Declares the parser. TenPd.parse (ISR, CB,true); The parser begins parsing, the first parameter represents the parsed content, the second parameter indicates that a certain label is parsed, and a callback method is called.
1 Public classSpiderparsercallbackextendsHtmleditorkit.parsercallback {//All methods of this class do the corresponding actions for the tags resolved by the parser. 2 Privateurltreenode node;3 PrivateDefaultmutabletreenode TreeNode;4 PrivateString Lasttext = "";5 6 PublicSpiderparsercallback (Defaultmutabletreenode atreenode) {7 This. TreeNode =Atreenode;8 This. Node = ((Urltreenode) This. Treenode.getuserobject ());9 }Ten One Public voidHandlesimpletag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Special parsing of simple tags. A if(T.equals (HTML. tag.img)) { - This. node.addimages (1); - return; the } - if(T.equals (HTML. Tag.base)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); - if(Value! =NULL) { + This. Node.setbase (Spider.fixhref (value.tostring ())); - } + } A } at - Public voidHandlestarttag (HTML. Tag T, MutableAttributeSet A,intPOS) {//Specifically parse a complex field with a start end tag. - if(T.equals (HTML. Tag.title)) { - This. Lasttext = ""; - return; - } in if(T.equals (HTML. TAG.A)) { -Object value =A.getattribute (HTML. ATTRIBUTE.HREF); to if(Value! =NULL) { + This. node.addlinks (1); -String href =value.tostring (); thehref =spider.fixhref (href); * if(Href.contains ("javascript:")) { $ return;Panax Notoginseng } - Try { theURL Referencedurl =NewURL ( This. Node.getbase (), href); +Spider. This. Searchweb ( This. TreeNode, AReferencedurl.getprotocol () + "://" the+referencedurl.gethost () ++Referencedurl.getpath ()); -}Catch(malformedurlexception e) { $Spider. This. Messagearea $. Append ("Bad URL encountered 2:" +href -+ "\ n"); - return; the } - }Wuyi } the}
The above two code shows that the parsing process is roughly, parser parser parse out the HTML content, classified into various tags, and then various tags callback parser in the various corresponding methods to finally implement the parser workflow. These are just my shallow understanding, certainly not very thorough, and the expression is not very clear, but still please a lot of criticism correct, common progress!
Alas, did not come here for a long time, also means that I have not done a thing for more than a year, is the return of the Times! (Simply talk about the spider's parser)