Processing IFRAME by HTML Parser

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, I have been studying crawler-related things and used HTML Parser. I don't want to talk about anything else. Most of the content on the Internet is my experience with custom tags.

1. Brief Introduction to HTML

In the words of Baidu Encyclopedia:

Htmlparser is a pure library written in Java for HTML Parsing. It does not depend on other Java library files and is mainly used for transformation or
Extract HTML. It can parse HTML at a high speed without errors.
To put it bluntly, htmlparser is currently the best tool for HTML parsing and analysis.
Whether you want to capture web page data or modify HTML content, htmlparser can never help but praise it.

In its own words:

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. it is a fast, robust and well tested package.

2. For IFRAME such common tags, strange HTML Parser would not have (I am using the latest htmlparser-2.0.jar), online looking for a bunch of, it seems that there is no good way, I tried to read its Java Doc and got it again.

First create a new Java project, create the directory structure ORG/htmlparser/tags, and add the htmlparser-2.0.jar downloaded from the official website to the project, decompile the frametag. class, refer to writing iframetag. java, the complete content is as follows:

Package org.html parser. tags; Import org.html parser. lexer. page; Import org.html parser. nodes. tagnode; public class iframetag extends tagnode { Private Static final string [] mids = {"iframe "}; Public String [] getids () { return mids; } Public String getframelocation () { string ret = super. getattribute ("src"); If (null = Re T) ret = ""; else if (null! = Super. getpage () { ret = super. getpage (). getabsoluteurl (RET); } return ret; } Public void setframelocation (string URL) { super. setattribute ("src", URL); } Public String getframename () { return Super. getattribute ("name"); } Public String tostring () { return "IFRAME tag: iframe "+ getframename () +" at "+ getframelocation () +"; begins at: "+ super. getstartposition () + "; ends at:" + super. getendposition (); } }

Then compile this class to get iframetag. class file, use WinRAR or 7zip to open the official jar package and compile the iframetag. put the class in the jar package directory org.html parser. in tags, a complete jar package is obtained. At this time, you can delete the custom directory structure ORG/htmlparser/tags in our project.

The next step is how to use it. Generally, a combination of orfilter can be used to extract nodes through extractallnodesthatmatch. But before that, you need to modify the code. Otherwise, the custom tag still does not work. Before calling

Org.html parser. parser hphtmlparser = org.html parser. parser. createparser (shtmlcontent, scharset );

Then, you need to set it up (this is the key)

Prototypicalnodefactory partition = new prototypicalnodefactory (); pnfprototypicalnodefactory. registertag (New iframetag (); hphtmlparser. setnodefactory (pnfprototypicalnodefactory );

Only in this way can the tag be correctly extracted through extractallnodesthatmatch.

Good luck!

References:

1. HTML Parser official homepage: http://htmlparser.sourceforge.net/

2. HTML Parser official documents: http://htmlparser.sourceforge.net/javadoc/index.html

HTML Parser Baidu Encyclopedia: http://baike.baidu.com/view/1174491.htm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Processing IFRAME by HTML Parser

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Processing IFRAME by HTML Parser

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support