Parsing Web pages using the Httpparser class

Source: Internet
Author: User


Previous article: Making a simple web crawler using the String class

http://blog.csdn.net/gfd54gd5f46/article/details/54729874

    • This is based on the substring () method in the string class to intercept the string to get the desired content.

    • This approach can be achieved if it is simply intercepting simple data.

    • But if I want to get the specified data (which may be thousands of bars), then the method inside the string class will be cumbersome, and the code to be written will become much more.




Now we are going to refer to a Web page parsing tool class to help us parse the page more easily.

Download Htmlparser class

official address:http://htmlparser.sourceforge.net/

Online API Documentation:http://htmlparser.sourceforge.net/javadoc/index.html

:https://sourceforge.net/projects/htmlparser/files/


Go in, htmlparser. Download version 1.6


Unzip after download is complete

Importing Jar Packages


Right-click the project


Create a folder with the name Lib


Copy the Htmlparser.jar in.


Right-click Project->properties->java Build Path


Add a Jar Package


Okay, last OK.

Create a Parser object on your code try to see if it's okay.



This will refer to the online download of the jar package, now you can parse the page.

Parsing Web pages



Find a movie site to test it.

Http://www.dytt8.net


get a download link for a single video
    • Go into the Japanese-Korean movie here


And then open a movie title, I'll choose the first one here

    • Opened and found this page is the introduction of the film
/20170129/53099.html


F12 go in debug mode

    • Analysis


Find the entire page with only one FTP ID

ftp://ygdy8:[email protected]:9239/[阳光电影www.ygdy8.com].蜡笔小新:梦境世界大突击.BD.720p.日国粤三语中字.mkv


After discovering the law, do we just extract the string containing the beginning of the FTP?

    • Code implementation
    /** * Get the movie * *     Public Static void  test1() {Try{Parser Parser =NewParser ("Http://www.dytt8.net/html/gndy/dyzz/20170208/53190.html");//Extract all matching nodes to the node list            //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list             for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print LabelsSystem.out.println (Tag.getlink ()); }           }Catch(Parserexception E1)        {E1.printstacktrace (); }    }


After running it is really getting to

2. Get all the movie presentation page addresses in a single list


We know that each film's introductory page is a different page.

比如刚刚的页面,它的后缀是:/html/gndy/dyzz/20170129/53099.html分析一下页面源码:http://www.dytt8.net/html/gndy/dyzz/index.html发现这个列表是在table里面的


Each line has a variety of information about the movie


Can we get an introductory address for each page?

“/html/gndy/dyzz/20170129/53099.html”首先还是一样,分析一下规律查找内容时发现在当前table下每个a标签都有一个class属性是“ulink”


Once we've found the pattern, we'll implement it in code.

    • Code implementation
/** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get an introductory address for 25 movies from the Web page */     Public Static void test2() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Filter All the class attributes are labeled "Ulink"NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i);            System.out.println (Tag.getlink ()); }        }Catch(Parserexception E1) {//TODO auto-generated catch blockE1.printstacktrace (); }    }


So you get a link to each introduction page.

3. Get all the list pages


At the beginning of the 2nd step, we've got all the introduction pages of a single list page, so we can get all the list pages.



Similarly, analysis of the Web source code, find the law


Find the option tag to find 161 data


Filter again. option contains a total of 159 for the Value property, because the first two are not list labels, so there are 157 total found


Once we know the rules, we can get the

    • Code implementation
    /** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get 157 Movie page addresses from the webpage */     Public Static void test3() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) ");             for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content                if(Tag.getattribute ("Value"). Contains ("List") {System.out.println (Tag.getattribute ("Value")); }            }        }Catch(Parserexception E1)        {E1.printstacktrace (); }


So we got all the listings.

actual Combat


Now I'm going to put these three methods together:

    • 1. Get to all lists

    • 2. Traverse the introduction page of each list

    • 3, through the introduction page to get



Full source

 PackageCom.lingdu.htmlparser;ImportJava.io.BufferedWriter;ImportJava.io.File;ImportJava.io.FileNotFoundException;ImportJava.io.FileOutputStream;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintWriter;ImportJava.io.UnsupportedEncodingException;ImportJava.util.ArrayList; Public  class htmlparserdemo {    /** * Save movie * @param i data * @param Content Contents * @param pathName Save path */     Public Static void savemoviedownloadaddress(intI, string content, String pathName) {if(!pathname.equals ("") {File FileName =NewFile (PathName);Try{PrintWriter PW =NewPrintWriter (NewBufferedWriter (NewOutputStreamWriter (NewFileOutputStream (FileName,true),"UTF-8"))); PW.PRINTLN (i +":"+ content);                Pw.flush ();                Pw.close (); System.out.println ("Save------>"+ content +"Success!" "); }Catch(FileNotFoundException e)            {E.printstacktrace (); }Catch(Unsupportedencodingexception e)            {E.printstacktrace (); }        }    }/** * 3, get the movie through the introduction address * *     Public StaticList<string>Getdownloadurl(String Movieurl) {List<string> List =NewArraylist<string> ();Try{Parser Parser =NewParser ("Http://www.ygdy8.net"+ Movieurl);//Extract all matching nodes to the node list            //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list             for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print Labels                //system.out.println (Tag.getlink ());List.add (Tag.getlink ()); }           }Catch(Parserexception E1)        {E1.printstacktrace (); }returnList }/** * 2. Get the address of all movies from the webpage * *     Public StaticList<string>getallmoviepagefromonelist(String Onelisturl) {//Save an introductory address for all movieslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser ("http://www.ygdy8.net/html/gndy/dyzz/"+ Onelisturl); NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i);//system.out.println (Tag.getlink ());List.add (Tag.getlink ()); }        }Catch(Parserexception E1)        {E1.printstacktrace (); }returnList }/** * 1, the address of the page to get the movie from the webpage */     Public StaticList<string>Getalllistfromurl(String URL) {//Create a list collection to hold all the listing pageslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser (URL);//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) ");             for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content                if(Tag.getattribute ("Value"). Contains ("List")) {//system.out.println (Tag.getattribute ("value"));List.add (Tag.getattribute ("Value")); }            }        }Catch(Parserexception E1)        {E1.printstacktrace (); }//list_23_1.html        returnList }/** * Integration logic * To run all the method sets */     Public Static void logicintegration() {//Save all pages of the paging listlist<string> alllist = Getalllistfromurl ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Save All Movie page addressesList<string> Allmoviepageurl =NewArraylist<string> ();//Save All MoviesList<string> Alldownloadurl =NewArraylist<string> ();//Number of statistics        inti =0; for(String str1:alllist) {System.out.println ("\ n page:-------------------->"+ str1 +"--------------------"); Allmoviepageurl = Getallmoviepagefromonelist (STR1); for(String Str2:allmoviepageurl) {Alldownloadurl = Getdownloadurl (STR2); for(String Str3:alldownloadurl) {i + =1;//movie DownLoad Address.txt is a file name that you can define yourselfSavemoviedownloadaddress (I,STR3,"Movie DownLoad Address.txt");//system.out.println (STR3);}            }        }    } Public Static void Main(string[] args)    {logicintegration (); }}



Save all!

Parsing Web pages using the Httpparser class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.