Preface: This is the second article of Java Crawler, in the first article just crawl the target site link based on to further improve the difficulty, grab the content we need on the target page and save it in the database. The test cases here have chosen a movie download website (http://www.80s.la/) that I use most often. Originally wanted to crawl all the movies on the site download links, and later felt the need for too long, so changed to crawl the 2015 movie download link.
Note: At the end of the article I crawled to the entire list of download links (including: Movie name and thunder download link )
A brief introduction to the principle
In fact, the principle is similar to the first article, the difference is that the list of the site is too many, if not the choice of these labels, it takes time to imagine
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/78/E0/wKiom1aEkt2Q8pl6AAIbgtJguRY014.png "title=" 20151229133942_73432.png "alt=" Wkiom1aekt2q8pl6aaibgtjgury014.png "/>
category links and label links do not, do not use these links to crawl other pages,get a list of movies from other pages only by paging through all types of movies at the bottom of the page. At the same time, for the movie detail page, just grab the movie title and thunder download link,do not crawl in depth, some of the recommended movies on the details page are connected through the chain.
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/78/E1/wKiom1aEkynSTrpPAAIy-OrLs0k412.png "title=" 20151229134439_52484.png "alt=" Wkiom1aekynstrppaaiy-orls0k412.png "/>
Finally, the download link for all the captured movies is saved in the Videolinkmap collection, and the data is saved to MySQL by traversing the collection.
Note: If the principle is still not clear enough, it is recommended to see my previous article:Http://www.zifangsky.cn/2015/12/java crawler Combat (a): Crawl all the links on a website/
Two-code implementation
The implementation principle has been stated above, and the code has detailed comments, so there is not much to say, the code is as follows:
package action;import java.io.bufferedreader;import java.io.ioexception;import java.io.inputstream;import java.io.inputstreamreader;import java.net.httpurlconnection;import java.net.malformedurlexception;import java.net.url;import java.sql.connection;import java.sql.preparedstatement;import java.sql.sqlexception;import java.util.linkedhashmap;import java.util.map;import java.util.regex.matcher;import java.util.regex.pattern;public class Videolinkgrab {public static void main (String[] args) {VideoLinkGrab Videolinkgrab = new videolinkgrab (); Videolinkgrab.savedata ("http://www.80s.la/movie/list/-2015----P");} /** * data to be saved in the database * * @param baseUrl * crawler start * @return null * */public Void savedata (string baseurL) {Map<String, Boolean> oldMap = new LinkedHashMap<String, Boolean> (); // Storage link-whether it is traversed map<string, string> videolinkmap = new Linkedhashmap<string, string> (); // video download link string oldlinkhost = ""; // hostpattern p = pattern.compile ("(https?:/ /)? [^/\\s]*]; // such as: Http://www.zifangsky.cnMatcher m = p.matcher (BASEURL);if ( M.find ()) {oldlinkhost = m.group ();} Oldmap.put (Baseurl, false); Videolinkmap = crawllinks (Oldlinkhost, oldmap);// traversal, The data is then saved in the Database Try {connection connection = jdbcdemo.getconnection ();for (Map.Entry< String, string> mapping : videolinkmap.entryset ()) {PreparedStatement Pstatement = connection.preparestatement ("Insert into movie (MovieName,MovieLink) VALUES (?,?) "); Pstatement.setstriNg (1, mapping.getkey ());p statement.setstring (2, mapping.getvalue ());p statement.executeupdate (); Pstatement.close ();//system.out.println (Mapping.getkey () + : + Mapping.getvalue ());} Connection.close ();} catch (sqlexception e) {e.printstacktrace ();}} /** * Crawl a Web site can crawl all the Web links, in the idea of using the breadth-first algorithm to the new link has not been traversed continuously initiated get request, until the whole collection has been traversed to find a new link * Indicates that a new link cannot be found, the end of the task * * the request for a link, the Web page with a regular search for the video link we need to find and then save the collection videolinkmap * * @param oldLinkHost * domain names, such as:http://www.zifangsky.cn * @param oldMap * link collection to traverse * * @return Return all captured video download link collection * */private map<string, string> crawllinks (String oldlinkhost,map<string, &NBSP;BOOLEAN>&NBSP;OLDMAP) {map<sTring, boolean> newmap = new linkedhashmap<string, boolean> (); // New link to get to each loop map<string, string> videolinkmap = new linkedhashmap<string , string> (); // video download link string oldlink = "";for (Map.entry<string, boolean> mapping : oldmap.entryset ()) {// system.out.println ("Link:" + mapping.getkey () + "--------Check:"// + mapping.getvalue ());// if not traversed if (!mapping.getvalue ()) {oldlink = mapping.getkey ();// initiates a GET request try {url url = new url (Oldlink); httpurlconnection connection = (HttpURLConnection) url.openconnection (); Connection.setrequestmethod ("GET"); Connection.setconnecttimeout (2500); Connection.setreadtimeout (2500);if (Connection.getresponsecode () == 200) {InputStream inputStream = Connection.getinputstreAM (); Bufferedreader reader = new bufferedreader (New inputstreamreader (inputStream, " UTF-8 ")); string line = ""; pattern pattern = null; matcher matcher = null;//Movie Details page, take out the video download link, do not continue to crawl other pages if (Ismoviepage (oldlink)) {boolean checktitle = false; string title = "";while ((Line = reader.readline ()) != null) {// Remove the video title in the page if (!checktitle) {pattern = pattern.compile ("([^\\s]+) .*?</title>"); matcher = pattern.matcher (line), if (Matcher.find ()) {Title = matcher.group (1);checktitle = true; Continue;}} Remove the video download link from the page pattern = pattern.compile ("(thunder:[^\"]+). *thunder[rr]es[tt]itle=\ "[^\"]*\ ""); Matcher = pattern.matcher (line);if (Matcher.find ()) {videolinkmap.put (Title,matcher.group ( 1)); System.out.println ("Video name: " + title + " ------ video link:" + matcher.gRoup (1)) break; //The current page has been detected}}}//movie list page else if (Checkurl (Oldlink)) {while ((line = reader.readline ()) != null) {pattern = pattern.compile ("<a href=\" ([^\] \\s]*) \ ""); Matcher = pattern.matcher (line);while (Matcher.find ()) {string newlink = matcher.group (1). Trim (); // link// Determines whether the obtained link starts with HTTP if (!newlink.startswith (" http ")) {if (Newlink.startswith ("/")) Newlink = oldlinkhost + newlink;elsenewlink = oldLinkHost + "/" + newlink;} Remove /if (Newlink.endswith ("/")) at the end of the link newlink = newlink.substring (0,newlink.length () - 1);// , and discard links to other sites if (!oldmap.containskey (newLink) && !newmap.containskey (NewLink) && (Checkurl (newLink) | | ismoviepage (NewLink)) {system.out.println ("temp: " + newlink); Newmap.put (NewLink, false);}}} ReaDer.close (); Inputstream.close ();} Connection.disconnect ();} catch (malformedurlexception e) {e.printstacktrace ();} catch (ioexception e) {e.printstacktrace ();} Try {thread.sleep (1000);} catch (interruptedexception e) {e.printstacktrace ();} Oldmap.replace (Oldlink, false, true);}} has a new link to continue traversing if (!newmap.isempty ()) {oldmap.putall (Newmap); Videolinkmap.putall (Crawllinks ( OLDLINKHOST,&NBSP;OLDMAP)); // will not cause duplicate key-value pairs}return videolinkmap due to the map's characteristics;} /** * determine if the 2015 Movie list page * @param url url * to be checked @return status * */public boolean checkurl (String url) {pattern pattern = Pattern.compile ("http://www.80s.la/movie/list/-2015----p\\d*"); Matcher matcher = pattern.matcher (URL), if (Matcher.find ()) return true; // List of 2015 Elsereturn false;} /** * determine if the page is a movie detail page * @param url page links * @return status * */public boolean ismoviepage (String url) {pattern pattern = pattern.compile ("http://www.80s.la/movie/\\d+"); Matcher matcher = pattern.matcher (URL), if (Matcher.find ()) return true; // Movie page Else return false;}}
Note: If you want to implement some of the other sites to crawl some of the specified content, you need to have some of these regular expressions according to the actual situation of reasonable modification
Three test results
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/78/DF/wKioL1aEk5LiQA3pAAFZpcGgbdU210.png "style=" float: none; "title=" Test 1 "alt=" Wkiol1aek5liqa3paafzpcggbdu210.png "/>
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/78/DF/wKioL1aEk5Pz7FDYAAFQcrEAnqY514.png "style=" float: none; "Title=" Test 2 "alt=" Wkiol1aek5pz7fdyaafqcreanqy514.png "/>
attached: I have exported the data to a Word document, please download it. Baidu Cloud Link:http://pan.baidu.com/s/1nuupHMx
This article is from "Zifangsky's personal blog" blog, make sure to keep this source http://983836259.blog.51cto.com/7311475/1730243
Java Crawler Combat (ii): Crawl A video site 2015 years of all the movie download link