Java Crawler Combat (ii): Crawl A video site 2015 years of all the movie download link

Source: Internet
Author: User

Preface: This is the second article of Java Crawler, in the first article just crawl the target site link based on to further improve the difficulty, grab the content we need on the target page and save it in the database. The test cases here have chosen a movie download website (http://www.80s.la/) that I use most often. Originally wanted to crawl all the movies on the site download links, and later felt the need for too long, so changed to crawl the 2015 movie download link.

Note: At the end of the article I crawled to the entire list of download links (including: Movie name and thunder download link )

A brief introduction to the principle

In fact, the principle is similar to the first article, the difference is that the list of the site is too many, if not the choice of these labels, it takes time to imagine

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/78/E0/wKiom1aEkt2Q8pl6AAIbgtJguRY014.png "title=" 20151229133942_73432.png "alt=" Wkiom1aekt2q8pl6aaibgtjgury014.png "/>

category links and label links do not, do not use these links to crawl other pages,get a list of movies from other pages only by paging through all types of movies at the bottom of the page. At the same time, for the movie detail page, just grab the movie title and thunder download link,do not crawl in depth, some of the recommended movies on the details page are connected through the chain.

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/78/E1/wKiom1aEkynSTrpPAAIy-OrLs0k412.png "title=" 20151229134439_52484.png "alt=" Wkiom1aekynstrppaaiy-orls0k412.png "/>

Finally, the download link for all the captured movies is saved in the Videolinkmap collection, and the data is saved to MySQL by traversing the collection.

Note: If the principle is still not clear enough, it is recommended to see my previous article:Http://www.zifangsky.cn/2015/12/java crawler Combat (a): Crawl all the links on a website/

Two-code implementation

The implementation principle has been stated above, and the code has detailed comments, so there is not much to say, the code is as follows:

package action;import java.io.bufferedreader;import java.io.ioexception;import  java.io.inputstream;import java.io.inputstreamreader;import java.net.httpurlconnection;import  java.net.malformedurlexception;import java.net.url;import java.sql.connection;import  java.sql.preparedstatement;import java.sql.sqlexception;import java.util.linkedhashmap;import  java.util.map;import java.util.regex.matcher;import java.util.regex.pattern;public class  Videolinkgrab {public static void main (String[] args)  {VideoLinkGrab  Videolinkgrab = new videolinkgrab (); Videolinkgrab.savedata ("http://www.80s.la/movie/list/-2015----P");} /** *  data to be saved in the database  *  *  @param  baseUrl *              crawler start  *  @return  null * */public  Void savedata (string baseurL)  {Map<String, Boolean> oldMap = new LinkedHashMap<String,  Boolean> (); //  Storage link-whether it is traversed map<string, string> videolinkmap = new  Linkedhashmap<string, string> (); //  video download link string oldlinkhost =  "";  //  hostpattern p = pattern.compile ("(https?:/ /)? [^/\\s]*]; //  such as: Http://www.zifangsky.cnMatcher m = p.matcher (BASEURL);if  ( M.find ())  {oldlinkhost = m.group ();} Oldmap.put (Baseurl, false); Videolinkmap = crawllinks (Oldlinkhost, oldmap);//  traversal, The data is then saved in the Database Try {connection connection = jdbcdemo.getconnection ();for  (Map.Entry< String, string> mapping : videolinkmap.entryset ())  {PreparedStatement  Pstatement = connection.preparestatement ("Insert into movie (MovieName,MovieLink)   VALUES (?,?) "); Pstatement.setstriNg (1, mapping.getkey ());p statement.setstring (2, mapping.getvalue ());p statement.executeupdate (); Pstatement.close ();//system.out.println (Mapping.getkey ()  +   :   +  Mapping.getvalue ());} Connection.close ();}  catch  (sqlexception e)  {e.printstacktrace ();}} /** *  Crawl a Web site can crawl all the Web links, in the idea of using the breadth-first algorithm   to the new link has not been traversed continuously initiated get request,  until the whole collection has been traversed to find a new link  *  Indicates that a new link cannot be found, the end of the task  *  *  the request for a link, the Web page with a regular search for the video link we need to find and then save the collection videolinkmap *   *  @param  oldLinkHost *              domain names, such as:http://www.zifangsky.cn *  @param  oldMap *              link collection to traverse  *  *  @return   Return all captured video download link collection  *  */private map<string, string> crawllinks (String oldlinkhost,map<string, &NBSP;BOOLEAN&GT;&NBSP;OLDMAP)  {map<sTring, boolean> newmap = new linkedhashmap<string, boolean> ();  //   New link to get to each loop map<string, string> videolinkmap = new linkedhashmap<string , string> (); //  video download link string oldlink =  "";for  (Map.entry<string,  boolean> mapping : oldmap.entryset ())  {// system.out.println ("Link:"  +  mapping.getkey ()  +  "--------Check:"// + mapping.getvalue ());//  if not traversed if  (!mapping.getvalue ())  {oldlink = mapping.getkey ();//  initiates a GET request try {url url =  new url (Oldlink); httpurlconnection connection =  (HttpURLConnection)  url.openconnection (); Connection.setrequestmethod ("GET"); Connection.setconnecttimeout (2500); Connection.setreadtimeout (2500);if  (Connection.getresponsecode ()  == 200)  {InputStream inputStream =  Connection.getinputstreAM (); Bufferedreader reader = new bufferedreader (New inputstreamreader (inputStream,  " UTF-8 ")); string line =  ""; pattern pattern = null; matcher matcher = null;//Movie Details page, take out the video download link, do not continue to crawl other pages if (Ismoviepage (oldlink)) {boolean  checktitle = false; string title =  "";while  ((Line = reader.readline ())  != null)  {// Remove the video title in the page if (!checktitle) {pattern = pattern.compile ("([^\\s]+) .*?</title>"); matcher =  pattern.matcher (line), if (Matcher.find ()) {Title = matcher.group (1);checktitle = true; Continue;}}   Remove the video download link from the page pattern = pattern.compile ("(thunder:[^\"]+). *thunder[rr]es[tt]itle=\ "[^\"]*\ ""); Matcher = pattern.matcher (line);if  (Matcher.find ())  {videolinkmap.put (Title,matcher.group ( 1)); System.out.println ("Video name: " + title +  "  ------   video link:" + matcher.gRoup (1)) break;  //The current page has been detected}}}//movie list page else if (Checkurl (Oldlink)) {while  ((line =  reader.readline ())  != null)  {pattern = pattern.compile ("<a href=\" ([^\] \\s]*) \ ""); Matcher = pattern.matcher (line);while  (Matcher.find ())  {string newlink  = matcher.group (1). Trim (); //  link//  Determines whether the obtained link starts with HTTP if  (!newlink.startswith (" http "))  {if  (Newlink.startswith ("/")) Newlink = oldlinkhost + newlink;elsenewlink  = oldLinkHost +  "/"  + newlink;}   Remove  /if  (Newlink.endswith ("/")) at the end of the link newlink = newlink.substring (0,newlink.length ()  - 1);// , and discard links to other sites if  (!oldmap.containskey (newLink) && !newmap.containskey (NewLink) &&  (Checkurl (newLink)  | |  ismoviepage (NewLink))  {system.out.println ("temp: "  + newlink); Newmap.put (NewLink,  false);}}} ReaDer.close (); Inputstream.close ();} Connection.disconnect ();}  catch  (malformedurlexception e)  {e.printstacktrace ();}  catch  (ioexception e)  {e.printstacktrace ();} Try {thread.sleep (1000);}  catch  (interruptedexception e)  {e.printstacktrace ();} Oldmap.replace (Oldlink, false, true);}}   has a new link to continue traversing if  (!newmap.isempty ())  {oldmap.putall (Newmap); Videolinkmap.putall (Crawllinks ( OLDLINKHOST,&NBSP;OLDMAP)); //  will not cause duplicate key-value pairs}return videolinkmap due to the map's characteristics;} /** *  determine if the 2015 Movie list page  *  @param  url  url *  to be checked @return   status  *  */public boolean checkurl (String url) {pattern pattern =   Pattern.compile ("http://www.80s.la/movie/list/-2015----p\\d*"); Matcher matcher = pattern.matcher (URL), if (Matcher.find ()) return true;  // List of 2015 Elsereturn false;} /** *  determine if the page is a movie detail page  *  @param  url   page links  *  @return   status  * */public boolean ismoviepage (String  url) {pattern pattern =  pattern.compile ("http://www.80s.la/movie/\\d+"); Matcher matcher = pattern.matcher (URL), if (Matcher.find ()) return true;  // Movie page Else return false;}}

Note: If you want to implement some of the other sites to crawl some of the specified content, you need to have some of these regular expressions according to the actual situation of reasonable modification

Three test results

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/78/DF/wKioL1aEk5LiQA3pAAFZpcGgbdU210.png "style=" float: none; "title=" Test 1 "alt=" Wkiol1aek5liqa3paafzpcggbdu210.png "/>

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/78/DF/wKioL1aEk5Pz7FDYAAFQcrEAnqY514.png "style=" float: none; "Title=" Test 2 "alt=" Wkiol1aek5pz7fdyaafqcreanqy514.png "/>

attached: I have exported the data to a Word document, please download it. Baidu Cloud Link:http://pan.baidu.com/s/1nuupHMx

This article is from "Zifangsky's personal blog" blog, make sure to keep this source http://983836259.blog.51cto.com/7311475/1730243

Java Crawler Combat (ii): Crawl A video site 2015 years of all the movie download link

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.