This article crawls the link of the target website on the basis of further improving the difficulty, grabbing the content we need on the target page and saving it in the database. The test case here uses a movie download website (http://www.80s.la/) that I often use. Originally wanted to crawl all the movies on the site download link, then feel the need for too long, so changed to crawl 2015 years to download the movie link.
Introduction to one principle
In fact, the principle is similar to the first article, the difference is that because of the site's classification list is too much, if not the choice of these labels, it takes time to imagine.
Category links and label links do not, do not use these links to crawl other pages, only through the bottom of all types of movies page to get other pages of the movie list. At the same time, for the movie details page, just crawl one of the movie title and thunder download link, do not go deep crawl, details of the pages of some recommended movies and other links through the link.
The last is to save all the downloaded links to the Videolinkmap collection and save the data in MySQL by traversing the collection.
Two code implementation
The implementation principle is already stated above, and there are detailed comments in the code, so there is no more to say here, the code is as follows:
Package action;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;
Import java.sql.Connection;
Import java.sql.PreparedStatement;
Import java.sql.SQLException;
Import Java.util.LinkedHashMap;
Import Java.util.Map;
Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern; public class Videolinkgrab {public static void main (string[] args) {Videolinkgrab videolinkgrab = new videolinkg
Rab ();
Videolinkgrab.savedata ("http://www.80s.la/movie/list/-2015----P"); /** * Save the acquired data in the database * * @param baseurl * Crawler start * @return NULL */public void SaveData (St Ring BaseURL) {map<string, boolean> oldmap = new linkedhashmap<string, boolean> ();//Storage links-whether to be traversed Ma p<string, string> videolinkmap = new linkedhashmap<string, string> (); Video download link String oldliNkhost = ""; Host pattern P = pattern.compile ("(https?:/ /)? [^/\\s]* ");
For example: http://www.zifangsky.cn Matcher m = P.matcher (BaseURL);
if (M.find ()) {oldlinkhost = M.group ();
} oldmap.put (BaseURL, false);
Videolinkmap = Crawllinks (Oldlinkhost, Oldmap);
Traversal, and then save the data in the database try {Connection Connection = jdbcdemo.getconnection (); For (map.entry<string, string> mapping:videoLinkMap.entrySet ()) {PreparedStatement pstatement = Connectio
N. preparestatement ("INSERT into Movie (Moviename,movielink) VALUES (?,?)");
Pstatement.setstring (1, Mapping.getkey ());
Pstatement.setstring (2, Mapping.getvalue ());
Pstatement.executeupdate ();
Pstatement.close ();
System.out.println (Mapping.getkey () + ":" + mapping.getvalue ());
} connection.close ();
catch (SQLException e) {e.printstacktrace (); /** * Crawl a Web site all the Web pages can be crawled links, in the idea of using a breadth-first algorithm for the new links have not been traversed to continuously initiate getRequest, all the way through the entire collection did not find the new link * means that no new links can be found, the task is over * * for a link to initiate a request, the page used to find the video link we need, found after the collection Videolinkmap * * @p Aram Oldlinkhost * Domain name, such as: http://www.zifangsky.cn * @param oldmap * The collection of links to be traversed * * @return returns all crawled view Frequency Download Link Collection */private map<string, string> crawllinks (String oldlinkhost, map<string, boolean> Oldmap {map<string, boolean> newmap = new linkedhashmap<string, boolean> ();//////////////////- string> Videolinkmap = new linkedhashmap<string, string> ();
Video download link String oldlink = ""; For (map.entry<string, boolean> mapping:oldMap.entrySet ()) {//System.out.println ("Link:" + mapping.getkey (
+ "--------Check:"//+ Mapping.getvalue ());
If not traversed if (!mapping.getvalue ()) {Oldlink = Mapping.getkey ();
Initiate get request try {URL url = new URL (oldlink);
HttpURLConnection connection = (httpurlconnection) URL . OpenConnection ();
Connection.setrequestmethod ("get");
Connection.setconnecttimeout (2500);
Connection.setreadtimeout (2500);
if (connection.getresponsecode () = = InputStream InputStream = Connection.getinputstream ();
BufferedReader reader = new BufferedReader (new InputStreamReader (InputStream, "UTF-8"));
String line = "";
Pattern pattern = null;
Matcher Matcher = null;
Movie Details page, remove the video download link, do not go deep crawl other pages if (Ismoviepage (Oldlink)) {Boolean checktitle = false;
String title = "";
while (line = Reader.readline ())!= null) {//Remove the video title from the page if (!checktitle) {
Pattern = Pattern.compile ("([^\\s]+) .*?</title>");
Matcher = Pattern.matcher (line);
if (Matcher.find ()) {title = Matcher.group (1); Checktitle = true;
Continue ///Remove video download link in the page pattern = compile ("Thund
Er:[^\ "]+). *thunder[rr]es[tt]itle=\" [^\ "]*\");
Matcher = Pattern.matcher (line);
if (Matcher.find ()) {Videolinkmap.put (Title,matcher.group (1));
System.out.println ("Video name:" + title + "------Video link:" + matcher.group (1)); Break
Current page has been detected}}//Movie list page else if (Checkurl (Oldlink)) { while (line = Reader.readline ())!= null) {pattern = pattern. Compi
Le ("<a href=\" ([^\ "\\s]*)" ");
Matcher = Pattern.matcher (line);
while (Matcher.find ()) {String NewLink = Matcher.group (1). Trim ();//Link Determines whether the acquired link is preceded by an HTTP opening if (!newlink.startswith ("http")) {if (Newlink.sta
Rtswith ("/")) NewLink = Oldlinkhost + NewLink;
else NewLink = Oldlinkhost + "/" + NewLink; //if (Newlink.endswith ("/")) NewLink = Newlink.substr The end of the link
ing (0, Newlink.length ()-1); Go heavy and discard links from other sites if (!oldmap.containskey (NewLink) &&!newmap.containskey ( NewLink) && (Checkurl (newlink) | | | ismoviepage (newlink))) {SYSTEM.OUT.PR
INTLN ("Temp:" + NewLink);
Newmap.put (NewLink, false);
}}} reader.close ();
Inputstream.close ();
} connection.disconnect (); } CATCH (malformedurlexception e) {e.printstacktrace ();
catch (IOException e) {e.printstacktrace ();
try {thread.sleep (1000);
catch (Interruptedexception e) {e.printstacktrace ();
} oldmap.replace (Oldlink, False, true);
}//There is a new link to continue traversing if (!newmap.isempty ()) {Oldmap.putall (NEWMAP); Videolinkmap.putall (Crawllinks (Oldlinkhost, Oldmap));
Because of the characteristics of the map, it does not cause duplicate key-value pairs to return videolinkmap;
/** * To determine whether the 2015 Movie list page * @param url to check URL * @return Status */public boolean checkurl (String URL) {
Pattern pattern = pattern.compile ("http://www.80s.la/movie/list/-2015----p\\d*");
Matcher Matcher = pattern.matcher (URL); if (Matcher.find ()) return true;
List of 2015 else return false;
/** * Determine if the page is a movie details page * @param URL page link * @return Status */public boolean ismoviepage (String URL) { Pattern pattern = PatterN.compile ("http://www.80s.la/movie/\\d+");
Matcher Matcher = pattern.matcher (URL); if (Matcher.find ()) return true;
Movie page else return false;
}
}
Note: If you want to implement some of the other Web site specific content, you need some of these regular expressions in accordance with the actual situation to make reasonable changes
Three test results
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.