Previous article: Making a simple web crawler using the String class
http://blog.csdn.net/gfd54gd5f46/article/details/54729874
This is based on the substring () method in the string class to intercept the string to get the desired content.
This approach can be achieved if it is simply intercepting simple data.
But if I want to get the specified data (which may be thousands of bars), then the method inside the string class will be cumbersome, and the code to be written will become much more.
Now we are going to refer to a Web page parsing tool class to help us parse the page more easily.
Download Htmlparser class
official address:http://htmlparser.sourceforge.net/
Online API Documentation:http://htmlparser.sourceforge.net/javadoc/index.html
:https://sourceforge.net/projects/htmlparser/files/
Go in, htmlparser. Download version 1.6
Unzip after download is complete
Importing Jar Packages
Right-click the project
Create a folder with the name Lib
Copy the Htmlparser.jar in.
Right-click Project->properties->java Build Path
Add a Jar Package
Okay, last OK.
Create a Parser object on your code try to see if it's okay.
This will refer to the online download of the jar package, now you can parse the page.
Parsing Web pages
Find a movie site to test it.
Http://www.dytt8.net
get a download link for a single video
- Go into the Japanese-Korean movie here
And then open a movie title, I'll choose the first one here
- Opened and found this page is the introduction of the film
/20170129/53099.html
F12 go in debug mode
Find the entire page with only one FTP ID
ftp://ygdy8:[email protected]:9239/[阳光电影www.ygdy8.com].蜡笔小新:梦境世界大突击.BD.720p.日国粤三语中字.mkv
After discovering the law, do we just extract the string containing the beginning of the FTP?
/** * Get the movie * * Public Static void test1() {Try{Parser Parser =NewParser ("Http://www.dytt8.net/html/gndy/dyzz/20170208/53190.html");//Extract all matching nodes to the node list //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print LabelsSystem.out.println (Tag.getlink ()); } }Catch(Parserexception E1) {E1.printstacktrace (); } }
After running it is really getting to
2. Get all the movie presentation page addresses in a single list
We know that each film's introductory page is a different page.
比如刚刚的页面,它的后缀是:/html/gndy/dyzz/20170129/53099.html分析一下页面源码:http://www.dytt8.net/html/gndy/dyzz/index.html发现这个列表是在table里面的
Each line has a variety of information about the movie
Can we get an introductory address for each page?
“/html/gndy/dyzz/20170129/53099.html”首先还是一样,分析一下规律查找内容时发现在当前table下每个a标签都有一个class属性是“ulink”
Once we've found the pattern, we'll implement it in code.
/** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get an introductory address for 25 movies from the Web page */ Public Static void test2() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Filter All the class attributes are labeled "Ulink"NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i); System.out.println (Tag.getlink ()); } }Catch(Parserexception E1) {//TODO auto-generated catch blockE1.printstacktrace (); } }
So you get a link to each introduction page.
3. Get all the list pages
At the beginning of the 2nd step, we've got all the introduction pages of a single list page, so we can get all the list pages.
Similarly, analysis of the Web source code, find the law
Find the option tag to find 161 data
Filter again. option contains a total of 159 for the Value property, because the first two are not list labels, so there are 157 total found
Once we know the rules, we can get the
/** * http://www.ygdy8.net/html/gndy/dyzz/index.html * Get 157 Movie page addresses from the webpage */ Public Static void test3() {Try{Parser Parser =NewParser ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) "); for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content if(Tag.getattribute ("Value"). Contains ("List") {System.out.println (Tag.getattribute ("Value")); } } }Catch(Parserexception E1) {E1.printstacktrace (); }
So we got all the listings.
actual Combat
Now I'm going to put these three methods together:
Full source
PackageCom.lingdu.htmlparser;ImportJava.io.BufferedWriter;ImportJava.io.File;ImportJava.io.FileNotFoundException;ImportJava.io.FileOutputStream;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintWriter;ImportJava.io.UnsupportedEncodingException;ImportJava.util.ArrayList; Public class htmlparserdemo { /** * Save movie * @param i data * @param Content Contents * @param pathName Save path */ Public Static void savemoviedownloadaddress(intI, string content, String pathName) {if(!pathname.equals ("") {File FileName =NewFile (PathName);Try{PrintWriter PW =NewPrintWriter (NewBufferedWriter (NewOutputStreamWriter (NewFileOutputStream (FileName,true),"UTF-8"))); PW.PRINTLN (i +":"+ content); Pw.flush (); Pw.close (); System.out.println ("Save------>"+ content +"Success!" "); }Catch(FileNotFoundException e) {E.printstacktrace (); }Catch(Unsupportedencodingexception e) {E.printstacktrace (); } } }/** * 3, get the movie through the introduction address * * Public StaticList<string>Getdownloadurl(String Movieurl) {List<string> List =NewArraylist<string> ();Try{Parser Parser =NewParser ("Http://www.ygdy8.net"+ Movieurl);//Extract all matching nodes to the node list //linkstringfilter ("FTP") node string filter that filters all strings that contain "ftp "NodeList NodeList = Parser.extractallnodesthatmatch (NewLinkstringfilter ("FTP"));//Traverse node list for(inti =0; I < nodelist.size (); i++) {//Save the first element of the node list to a labelLinktag tag = (linktag) nodelist.elementat (i);//Print Labels //system.out.println (Tag.getlink ());List.add (Tag.getlink ()); } }Catch(Parserexception E1) {E1.printstacktrace (); }returnList }/** * 2. Get the address of all movies from the webpage * * Public StaticList<string>getallmoviepagefromonelist(String Onelisturl) {//Save an introductory address for all movieslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser ("http://www.ygdy8.net/html/gndy/dyzz/"+ Onelisturl); NodeList NodeList = Parser.extractallnodesthatmatch (NewHasattributefilter ("Class","Ulink")); System.out.println ("Find:"+ nodelist.size () +"Bar data. "); for(inti =0; I < nodelist.size (); i++) {//Get the link label of the nodeLinktag tag = (linktag) nodelist.elementat (i);//system.out.println (Tag.getlink ());List.add (Tag.getlink ()); } }Catch(Parserexception E1) {E1.printstacktrace (); }returnList }/** * 1, the address of the page to get the movie from the webpage */ Public StaticList<string>Getalllistfromurl(String URL) {//Create a list collection to hold all the listing pageslist<string> list =NewArraylist<string> ();Try{Parser Parser =NewParser (URL);//Get all the option tags that contain the Value propertyNodeList NodeList = Parser.extractallnodesthatmatch (NewTagnamefilter ("option"). Extractallnodesthatmatch (NewHasattributefilter ("Value"));//system.out.println ("Found:" + nodelist.size () + "bar data.) "); for(inti =0; I < nodelist.size (); i++) {//Get the first elementOptiontag tag = (optiontag) nodelist.elementat (i);//If a label with the value attribute is obtained and the label contains a list content if(Tag.getattribute ("Value"). Contains ("List")) {//system.out.println (Tag.getattribute ("value"));List.add (Tag.getattribute ("Value")); } } }Catch(Parserexception E1) {E1.printstacktrace (); }//list_23_1.html returnList }/** * Integration logic * To run all the method sets */ Public Static void logicintegration() {//Save all pages of the paging listlist<string> alllist = Getalllistfromurl ("Http://www.ygdy8.net/html/gndy/dyzz/index.html");//Save All Movie page addressesList<string> Allmoviepageurl =NewArraylist<string> ();//Save All MoviesList<string> Alldownloadurl =NewArraylist<string> ();//Number of statistics inti =0; for(String str1:alllist) {System.out.println ("\ n page:-------------------->"+ str1 +"--------------------"); Allmoviepageurl = Getallmoviepagefromonelist (STR1); for(String Str2:allmoviepageurl) {Alldownloadurl = Getdownloadurl (STR2); for(String Str3:alldownloadurl) {i + =1;//movie DownLoad Address.txt is a file name that you can define yourselfSavemoviedownloadaddress (I,STR3,"Movie DownLoad Address.txt");//system.out.println (STR3);} } } } Public Static void Main(string[] args) {logicintegration (); }}
Save all!
Parsing Web pages using the Httpparser class