Parsing Web pages using the Httpparser class

Last Update:2017-02-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Previous article: Making a simple web crawler using the String class

http://blog.csdn.net/gfd54gd5f46/article/details/54729874

This is based on the substring () method in the string class to intercept the string to get the desired content.
This approach can be achieved if it is simply intercepting simple data.
But if I want to get the specified data (which may be thousands of bars), then the method inside the string class will be cumbersome, and the code to be written will become much more.

Now we are going to refer to a Web page parsing tool class to help us parse the page more easily.

Download Htmlparser class

official address:http://htmlparser.sourceforge.net/

Online API Documentation:http://htmlparser.sourceforge.net/javadoc/index.html

:https://sourceforge.net/projects/htmlparser/files/

Go in, htmlparser. Download version 1.6

Unzip after download is complete

Importing Jar Packages

Right-click the project

Create a folder with the name Lib

Copy the Htmlparser.jar in.

Right-click Project->properties->java Build Path

Add a Jar Package

Okay, last OK.

Create a Parser object on your code try to see if it's okay.

This will refer to the online download of the jar package, now you can parse the page.

Parsing Web pages

Find a movie site to test it.

Http://www.dytt8.net

get a download link for a single video

Go into the Japanese-Korean movie here

And then open a movie title, I'll choose the first one here

Opened and found this page is the introduction of the film

/20170129/53099.html

F12 go in debug mode

Analysis

Find the entire page with only one FTP ID

ftp://ygdy8:[email protected]:9239/[阳光电影www.ygdy8.com].蜡笔小新：梦境世界大突击.BD.720p.日国粤三语中字.mkv

After discovering the law, do we just extract the string containing the beginning of the FTP?

Code implementation

    /** * Get the movie * *     Public Static void  test1() {Try{Parser Parser =NewParser ("Http://www.dytt8.net/html/gndy/dyzz/20170208/53190.html");//Extract all matching nodes to the node list            //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list             for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print LabelsSystem.out.println (Tag.getlink ()); }           }Catch(Parserexception E1)        {E1.printstacktrace (); }    }

After running it is really getting to

2. Get all the movie presentation page addresses in a single list

We know that each film's introductory page is a different page.

比如刚刚的页面，它的后缀是：/html/gndy/dyzz/20170129/53099.html分析一下页面源码：http://www.dytt8.net/html/gndy/dyzz/index.html发现这个列表是在table里面的

Each line has a variety of information about the movie

Can we get an introductory address for each page?

“/html/gndy/dyzz/20170129/53099.html”首先还是一样，分析一下规律查找内容时发现在当前table下每个a标签都有一个class属性是“ulink”

Once we've found the pattern, we'll implement it in code.

Code implementation

/** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get an introductory address for 25 movies from the Web page */     Public Static void test2() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Filter All the class attributes are labeled "Ulink"NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i);            System.out.println (Tag.getlink ()); }        }Catch(Parserexception E1) {//TODO auto-generated catch blockE1.printstacktrace (); }    }

So you get a link to each introduction page.

3. Get all the list pages

At the beginning of the 2nd step, we've got all the introduction pages of a single list page, so we can get all the list pages.

Similarly, analysis of the Web source code, find the law

Find the option tag to find 161 data

Filter again. option contains a total of 159 for the Value property, because the first two are not list labels, so there are 157 total found

Once we know the rules, we can get the

Code implementation

    /** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get 157 Movie page addresses from the webpage */     Public Static void test3() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) ");             for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content                if(Tag.getattribute ("Value"). Contains ("List") {System.out.println (Tag.getattribute ("Value")); }            }        }Catch(Parserexception E1)        {E1.printstacktrace (); }

So we got all the listings.

actual Combat

Now I'm going to put these three methods together:

1. Get to all lists
2. Traverse the introduction page of each list
3, through the introduction page to get

Full source

 PackageCom.lingdu.htmlparser;ImportJava.io.BufferedWriter;ImportJava.io.File;ImportJava.io.FileNotFoundException;ImportJava.io.FileOutputStream;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintWriter;ImportJava.io.UnsupportedEncodingException;ImportJava.util.ArrayList; Public  class htmlparserdemo {    /** * Save movie * @param i data * @param Content Contents * @param pathName Save path */     Public Static void savemoviedownloadaddress(intI, string content, String pathName) {if(!pathname.equals ("") {File FileName =NewFile (PathName);Try{PrintWriter PW =NewPrintWriter (NewBufferedWriter (NewOutputStreamWriter (NewFileOutputStream (FileName,true),"UTF-8"))); PW.PRINTLN (i +":"+ content);                Pw.flush ();                Pw.close (); System.out.println ("Save------>"+ content +"Success!" "); }Catch(FileNotFoundException e)            {E.printstacktrace (); }Catch(Unsupportedencodingexception e)            {E.printstacktrace (); }        }    }/** * 3, get the movie through the introduction address * *     Public StaticList<string>Getdownloadurl(String Movieurl) {List<string> List =NewArraylist<string> ();Try{Parser Parser =NewParser ("Http://www.ygdy8.net"+ Movieurl);//Extract all matching nodes to the node list            //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list             for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print Labels                //system.out.println (Tag.getlink ());List.add (Tag.getlink ()); }           }Catch(Parserexception E1)        {E1.printstacktrace (); }returnList }/** * 2. Get the address of all movies from the webpage * *     Public StaticList<string>getallmoviepagefromonelist(String Onelisturl) {//Save an introductory address for all movieslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser ("http://www.ygdy8.net/html/gndy/dyzz/"+ Onelisturl); NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i);//system.out.println (Tag.getlink ());List.add (Tag.getlink ()); }        }Catch(Parserexception E1)        {E1.printstacktrace (); }returnList }/** * 1, the address of the page to get the movie from the webpage */     Public StaticList<string>Getalllistfromurl(String URL) {//Create a list collection to hold all the listing pageslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser (URL);//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) ");             for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content                if(Tag.getattribute ("Value"). Contains ("List")) {//system.out.println (Tag.getattribute ("value"));List.add (Tag.getattribute ("Value")); }            }        }Catch(Parserexception E1)        {E1.printstacktrace (); }//list_23_1.html        returnList }/** * Integration logic * To run all the method sets */     Public Static void logicintegration() {//Save all pages of the paging listlist<string> alllist = Getalllistfromurl ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Save All Movie page addressesList<string> Allmoviepageurl =NewArraylist<string> ();//Save All MoviesList<string> Alldownloadurl =NewArraylist<string> ();//Number of statistics        inti =0; for(String str1:alllist) {System.out.println ("\ n page:-------------------->"+ str1 +"--------------------"); Allmoviepageurl = Getallmoviepagefromonelist (STR1); for(String Str2:allmoviepageurl) {Alldownloadurl = Getdownloadurl (STR2); for(String Str3:alldownloadurl) {i + =1;//movie DownLoad Address.txt is a file name that you can define yourselfSavemoviedownloadaddress (I,STR3,"Movie DownLoad Address.txt");//system.out.println (STR3);}            }        }    } Public Static void Main(string[] args)    {logicintegration (); }}

Save all!

Parsing Web pages using the Httpparser class

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Parsing Web pages using the Httpparser class

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Parsing Web pages using the Httpparser class

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support