Processing IFRAME by HTML Parser

Source: Internet
Author: User

 

 

Recently, I have been studying crawler-related things and used HTML Parser. I don't want to talk about anything else. Most of the content on the Internet is my experience with custom tags.

 

1. Brief Introduction to HTML

In the words of Baidu Encyclopedia:

Htmlparser is a pure library written in Java for HTML Parsing. It does not depend on other Java library files and is mainly used for transformation or
Extract HTML. It can parse HTML at a high speed without errors.
To put it bluntly, htmlparser is currently the best tool for HTML parsing and analysis.
Whether you want to capture web page data or modify HTML content, htmlparser can never help but praise it.

 

In its own words:

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. it is a fast, robust and well tested package.

 

2. For IFRAME such common tags, strange HTML Parser would not have (I am using the latest htmlparser-2.0.jar), online looking for a bunch of, it seems that there is no good way, I tried to read its Java Doc and got it again.

First create a new Java project, create the directory structure ORG/htmlparser/tags, and add the htmlparser-2.0.jar downloaded from the official website to the project, decompile the frametag. class, refer to writing iframetag. java, the complete content is as follows:

 Package org.html parser. tags; </P> <p> Import org.html parser. lexer. page; <br/> Import org.html parser. nodes. tagnode; </P> <p> public class iframetag extends tagnode <br/> {<br/> Private Static final string [] mids = {"iframe "}; </P> <p> Public String [] getids () <br/> {<br/> return mids; <br/>}</P> <p> Public String getframelocation () <br/>{< br/> string ret = super. getattribute ("src"); <br/> If (null = Re T) <br/> ret = ""; <br/> else if (null! = Super. getpage () {<br/> ret = super. getpage (). getabsoluteurl (RET); <br/>}< br/> return ret; <br/>}</P> <p> Public void setframelocation (string URL) <br/>{< br/> super. setattribute ("src", URL); <br/>}</P> <p> Public String getframename () <br/>{< br/> return Super. getattribute ("name"); <br/>}</P> <p> Public String tostring () <br/>{< br/> return "IFRAME tag: iframe "+ getframename () +" at "+ getframelocation () +"; begins at: "+ super. getstartposition () + "; ends at:" + super. getendposition (); <br/>}< br/>}

 

Then compile this class to get iframetag. class file, use WinRAR or 7zip to open the official jar package and compile the iframetag. put the class in the jar package directory org.html parser. in tags, a complete jar package is obtained. At this time, you can delete the custom directory structure ORG/htmlparser/tags in our project.

 

The next step is how to use it. Generally, a combination of orfilter can be used to extract nodes through extractallnodesthatmatch. But before that, you need to modify the code. Otherwise, the custom tag still does not work. Before calling

Org.html parser. parser hphtmlparser = org.html parser. parser. createparser (shtmlcontent, scharset );

Then, you need to set it up (this is the key)

Prototypicalnodefactory partition = new prototypicalnodefactory (); <br/> pnfprototypicalnodefactory. registertag (New iframetag (); <br/> hphtmlparser. setnodefactory (pnfprototypicalnodefactory );

Only in this way can the tag be correctly extracted through extractallnodesthatmatch.

Good luck!

 

 

 

References:

1. HTML Parser official homepage: http://htmlparser.sourceforge.net/

2. HTML Parser official documents: http://htmlparser.sourceforge.net/javadoc/index.html

HTML Parser Baidu Encyclopedia: http://baike.baidu.com/view/1174491.htm

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.