Java crawler crawl video site download link _java

Source: Internet
Author: User
Tags readline

This article crawls the link of the target website on the basis of further improving the difficulty, grabbing the content we need on the target page and saving it in the database. The test case here uses a movie download website (http://www.80s.la/) that I often use. Originally wanted to crawl all the movies on the site download link, then feel the need for too long, so changed to crawl 2015 years to download the movie link.

Introduction to one principle

In fact, the principle is similar to the first article, the difference is that because of the site's classification list is too much, if not the choice of these labels, it takes time to imagine.

Category links and label links do not, do not use these links to crawl other pages, only through the bottom of all types of movies page to get other pages of the movie list. At the same time, for the movie details page, just crawl one of the movie title and thunder download link, do not go deep crawl, details of the pages of some recommended movies and other links through the link.

The last is to save all the downloaded links to the Videolinkmap collection and save the data in MySQL by traversing the collection.

Two code implementation

The implementation principle is already stated above, and there are detailed comments in the code, so there is no more to say here, the code is as follows:

Package action;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;
Import java.sql.Connection;
Import java.sql.PreparedStatement;
Import java.sql.SQLException;
Import Java.util.LinkedHashMap;
Import Java.util.Map;
Import Java.util.regex.Matcher;
 
Import Java.util.regex.Pattern; public class Videolinkgrab {public static void main (string[] args) {Videolinkgrab videolinkgrab = new videolinkg
    Rab ();
  Videolinkgrab.savedata ("http://www.80s.la/movie/list/-2015----P"); /** * Save the acquired data in the database * * @param baseurl * Crawler start * @return NULL */public void SaveData (St Ring BaseURL) {map<string, boolean> oldmap = new linkedhashmap<string, boolean> ();//Storage links-whether to be traversed Ma p<string, string> videolinkmap = new linkedhashmap<string, string> (); Video download link String oldliNkhost = ""; Host pattern P = pattern.compile ("(https?:/ /)? [^/\\s]* ");
    For example: http://www.zifangsky.cn Matcher m = P.matcher (BaseURL);
    if (M.find ()) {oldlinkhost = M.group ();
    } oldmap.put (BaseURL, false);
    Videolinkmap = Crawllinks (Oldlinkhost, Oldmap);
      Traversal, and then save the data in the database try {Connection Connection = jdbcdemo.getconnection (); For (map.entry<string, string> mapping:videoLinkMap.entrySet ()) {PreparedStatement pstatement = Connectio
        N. preparestatement ("INSERT into Movie (Moviename,movielink) VALUES (?,?)");
        Pstatement.setstring (1, Mapping.getkey ());
        Pstatement.setstring (2, Mapping.getvalue ());
        Pstatement.executeupdate ();
Pstatement.close ();
      System.out.println (Mapping.getkey () + ":" + mapping.getvalue ());
    } connection.close ();
    catch (SQLException e) {e.printstacktrace (); /** * Crawl a Web site all the Web pages can be crawled links, in the idea of using a breadth-first algorithm for the new links have not been traversed to continuously initiate getRequest, all the way through the entire collection did not find the new link * means that no new links can be found, the task is over * * for a link to initiate a request, the page used to find the video link we need, found after the collection Videolinkmap * * @p Aram Oldlinkhost * Domain name, such as: http://www.zifangsky.cn * @param oldmap * The collection of links to be traversed * * @return returns all crawled view Frequency Download Link Collection */private map<string, string> crawllinks (String oldlinkhost, map<string, boolean> Oldmap {map<string, boolean> newmap = new linkedhashmap<string, boolean> ();//////////////////- string> Videolinkmap = new linkedhashmap<string, string> ();
 
    Video download link String oldlink = ""; For (map.entry<string, boolean> mapping:oldMap.entrySet ()) {//System.out.println ("Link:" + mapping.getkey (
      + "--------Check:"//+ Mapping.getvalue ());
        If not traversed if (!mapping.getvalue ()) {Oldlink = Mapping.getkey ();
          Initiate get request try {URL url = new URL (oldlink);
    HttpURLConnection connection = (httpurlconnection) URL          . OpenConnection ();
          Connection.setrequestmethod ("get");
          Connection.setconnecttimeout (2500);
 
          Connection.setreadtimeout (2500);
            if (connection.getresponsecode () = = InputStream InputStream = Connection.getinputstream ();
            BufferedReader reader = new BufferedReader (new InputStreamReader (InputStream, "UTF-8"));
            String line = "";
            Pattern pattern = null;
            Matcher Matcher = null;
              Movie Details page, remove the video download link, do not go deep crawl other pages if (Ismoviepage (Oldlink)) {Boolean checktitle = false;
              String title = "";
                  while (line = Reader.readline ())!= null) {//Remove the video title from the page if (!checktitle) {
                  Pattern = Pattern.compile ("([^\\s]+) .*?</title>");
                  Matcher = Pattern.matcher (line);
     if (Matcher.find ()) {title = Matcher.group (1);               Checktitle = true;
                  Continue ///Remove video download link in the page pattern = compile ("Thund
                Er:[^\ "]+). *thunder[rr]es[tt]itle=\" [^\ "]*\");
                Matcher = Pattern.matcher (line);
                  if (Matcher.find ()) {Videolinkmap.put (Title,matcher.group (1));
                  System.out.println ("Video name:" + title + "------Video link:" + matcher.group (1)); Break 
              Current page has been detected}}//Movie list page else if (Checkurl (Oldlink)) { while (line = Reader.readline ())!= null) {pattern = pattern. Compi
                Le ("<a href=\" ([^\ "\\s]*)" ");
                Matcher = Pattern.matcher (line);
     while (Matcher.find ()) {String NewLink = Matcher.group (1). Trim ();//Link             Determines whether the acquired link is preceded by an HTTP opening if (!newlink.startswith ("http")) {if (Newlink.sta
                    Rtswith ("/")) NewLink = Oldlinkhost + NewLink;
                  else NewLink = Oldlinkhost + "/" + NewLink; //if (Newlink.endswith ("/")) NewLink = Newlink.substr The end of the link
                  ing (0, Newlink.length ()-1); Go heavy and discard links from other sites if (!oldmap.containskey (NewLink) &&!newmap.containskey ( NewLink) && (Checkurl (newlink) | | | ismoviepage (newlink))) {SYSTEM.OUT.PR
                    INTLN ("Temp:" + NewLink);
                  Newmap.put (NewLink, false);
            }}} reader.close ();
          Inputstream.close ();
        } connection.disconnect (); } CATCH (malformedurlexception e) {e.printstacktrace ();
        catch (IOException e) {e.printstacktrace ();
        try {thread.sleep (1000);
        catch (Interruptedexception e) {e.printstacktrace ();
      } oldmap.replace (Oldlink, False, true);
      }//There is a new link to continue traversing if (!newmap.isempty ()) {Oldmap.putall (NEWMAP); Videolinkmap.putall (Crawllinks (Oldlinkhost, Oldmap));
  Because of the characteristics of the map, it does not cause duplicate key-value pairs to return videolinkmap;
    /** * To determine whether the 2015 Movie list page * @param url to check URL * @return Status */public boolean checkurl (String URL) {
    Pattern pattern = pattern.compile ("http://www.80s.la/movie/list/-2015----p\\d*");
    Matcher Matcher = pattern.matcher (URL); if (Matcher.find ()) return true;
  List of 2015 else return false;
    /** * Determine if the page is a movie details page * @param URL page link * @return Status */public boolean ismoviepage (String URL) { Pattern pattern = PatterN.compile ("http://www.80s.la/movie/\\d+");
    Matcher Matcher = pattern.matcher (URL); if (Matcher.find ()) return true;
  Movie page else return false;
 }
   
}

Note: If you want to implement some of the other Web site specific content, you need some of these regular expressions in accordance with the actual situation to make reasonable changes

Three test results

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.