Node+experss realizes crawl movie Paradise Reptile _node.js

Source: Internet
Author: User
Tags emit error handling

Last week, I wrote a node+experss reptile beginner. Continue to study today and write a crawler 2.0 version.

This time we no longer climb the blog park, how to play some new, climb the movie Heaven. Because every weekend in the movie paradise download a movie to see.

Talk is cheap,show me the code!

Crawl page Analysis

Our goal:

1, grab movie Paradise Home, get the latest movie on the left 169 links

2, grab 169 new movie Thunder Download link, and concurrent asynchronous crawl.

The specific analysis is as follows:

1, we do not need to crawl all things thunder, only need to download the latest release of the film can, such as the left column below. A total of 170, excluding the first (because there are 200 films in the first), a total of 169 movies.

2, in addition to grab the first page of things, we also want to crawl point in after each movie thunder download link

Environment construction

1, Need things: node environment, Express, Cherrio These three are the previous article has introduced, so here no longer do the introduction: Click to view

2, need to install the new things:

Superagent

Role: Similar to request, we can use it to obtain get/post and other requests, and can set the relevant request header information, compared to the use of built-in modules, much simpler.

Usage:

var superagent = require (' superagent ');
Superagent
. Get ('/some-url ')
. End (function (err, res) {
  //do something 
});

Superagent-charset:

Function: Solve the coding problem, because the code of the movie Heaven is gb2312, the Chinese that crawl down will garbled out.

Usage:

var superagent = require (' superagent ');
var charset = require (' Superagent-charset ');
CharSet (superagent);

Superagent. Get
('/some-url ')
. CharSet (' gb2312 ')//Here Set the encoding
. End (function (err, res) {
  //Do Something 
});

Async

Role: Async is a Process Control toolkit that provides direct and powerful asynchronous functionality, which is invoked here as a processing concurrency.

Usage: Here's what needs to be used: Async.maplimit (arr, limit, iterator, callback)

Maplimit can initiate multiple asynchronous operations at the same time, then wait for callback to return together, and then launch the next one.

Arr is an array, limit the number of concurrent, each of the arr in turn to the iterator to execute, execution results to the final callback

Eventproxy:

Role: Eventproxy played a role in the counter, it to help you manage whether the asynchronous operation is completed, after the completion, it will automatically call you provide the processing function, and the data to be crawled when the parameters passed over.

For example, I first crawl to the movie Paradise front page sidebar link, can then crawl the content inside the link. Specific function can point here

Usage:

var EP = new Eventproxy ();
Ep.after (' Got_file ', files.length, function (list) {
 //the contents of all files will be executed after the asynchronous execution of all files in the 
 list array 
});
for (var i = 0; i < files.length i++) {
 fs.readfile (files[i], ' utf-8 ', function (err, content) {
  //Trigger result event 
   
    ep.emit (' Got_file ', content);
 }
Note Got_file These two names must correspond
   

Start the reptile

The main program in App.js here, so see the words can mainly see app.js can

1, first define some global variables, the introduction of the library introduced to

var cheerio = require (' Cheerio '); You can manipulate the interface like Jquer
var charset = require (' Superagent-charset ');//Solve garbled problem:
var superagent = require (' superagent ' ); Initiate request 
CharSet (superagent); 
var async = require (' async '); Asynchronous Grab
var express = require (' Express '); 
var eventproxy = require (' Eventproxy '); Process Control
var EP = Eventproxy ();
var app = Express ();

var baseurl = ' http://www.dytt8.net '; Thunderbolt Home Link
var newmovielinkarr=[];//Store the URL of the new movie
var errlength=[];   Number of links to statistical errors
var highscoremoviearr=[]//high score movie

2, start crawling Home thunder home:

First grab the Thunder first page
(function (page) {
  superagent
  . Get (page)
  . CharSet (' gb2312 ')
  . End (function (err, sres {
    //General error handling
    if (err) {
     console.log (' crawled ' +page+ ') return to
      next (err);
    }
    var $ = cheerio.load (sres.text);
    170 movie links, pay attention to
    Getallmovielink ($);
    Highscoremovie ($);
    *
    * * Process Control Statement
    * When the link on the left side of the first page crawled over, we started crawling inside the details page *
    /ep.emit (' get_topic_html ', ' Get ' +page+ ') Successful ');
  })
(BaseURL);

Here, we first grab the first page of things, the first page to crawl to the content of the pages to Getallmovielink and Highscoremovie these two functions to deal with,

Getallmovielink gets to the left side of the bar except for the 1th movie of the 169 movie.

Highscoremovie is the first link on the left column, which is a relatively high score of the film.

In the above code, we have a counter, when it is finished, we can execute the ' get_topic_html ' name corresponding to the process, so as to ensure that after the completion of the first page crawl work, and then perform the second page crawl.

Ep.emit (' get_topic_html ', ' Get ' +page+ ' successful ');

Highscoremovie method is as follows, in fact, we are not very useful here, but I count the high Score movie Home page information, lazy continue to crawl

Score 8 points above the film 200 dozen!, this is just statistical data, no longer fetching
function Highscoremovie ($) {
  var url= ' http://www.dytt8.net ' +$ ('. co_ Content2 ul a '). EQ (0). attr (' href ');
  Console.log (URL);
  Superagent
  . Get (URL)
  . CharSet (' gb2312 ')
  . End (function (err, sres) {
    //General error handling
    if (err) {
      Console.log (' Crawl ' +url+ ' this message error ')
    }
    var $ = cheerio.load (sres.text);
    var elemp=$ (' #Zoom p ');
    var elema=$ (' #Zoom a ');
    for (var k = 1; k < elemp.length; k++) {
      var hurl=elemp.eq (k). Find (' a '). Text ();
      if (Highscoremoviearr.indexof (Hurl) ==-1) {
        highscoremoviearr.push (Hurl);};}}
  );

3, separate the left column of information,

The following figure, the homepage, the details page links are here $ ('. Co_content2 ul a ').

So we'll iterate over the details page links in the left-hand column, and save it in a Newmovielinkarr array.

The Getallmovielink method is as follows:

Get all link
function Getallmovielink ($) {
  var linkelem=$ ('. Co_content2 ul a ') on the left column of the first page;
  for (Var i=1;i<170;i++) {
    var url= ' http://www.dytt8.net ' +linkelem.eq (i). attr (' href ');
    Note to
    newmovielinkarr.indexof if (URL) ==-1) {
      newmovielinkarr.push (URL);}
  }}

4, to get the movie Details page crawler, extract useful information, such as the movie download link, this is our concern.

Command EP repeatedly listens for emit events (get_topic_html) and executes Ep.after (' get_topic_html ', 1, function (EPS) {var concurr after get_topic_html crawl)
  Encycount = 0; var num=-4;
  Because it is 5 concurrent, you need to subtract 4///Use the callback function to return the result, and then take out the entire array of results in the result.
    var fetchurl = function (Myurl, callback) {var fetchstart = new Date (). GetTime ();
    concurrencycount++;
    Num+=1 Console.log (' Now the concurrency number is ', Concurrencycount, ', is being crawled is ', Myurl '); Superagent. Get (Myurl). CharSet (' gb2312 ')//troubleshoot coding problems. End (function (err, ssres) {if (err) {call
        Back (err, Myurl + ' ERROR happened! ');
        Errlength.push (Myurl);
      Return next (ERR);
      var time = new Date (). GetTime ()-Fetchstart;
      Console.log (' crawl ' + myurl + ' success ', ', time consuming ' + time + ' milliseconds ');

      concurrencycount--;

      var $ = cheerio.load (Ssres.text);
        The processing function of the obtained result Getdownloadlink ($,function (obj) {res.write (' <br/> ');
        Res.write (num+ ', movie name--> ' +obj.moviename);
       Res.write (' <br/> '); Res.write (' Thunder download link--> ' +obj.downlink);
        Res.write (' <br/> ');
        Res.write (' Details link--> <a href= ' +myurl+ ' target= ' _blank ' > ' +myurl+ ' <a/> ');
        Res.write (' <br/> ');
      Res.write (' <br/> ');
      });
      var result = {Movielink:myurl};
    Callback (null, result);
  });

  };
  Controls the maximum concurrency number of 5, and in the result, takes out the entire array of results returned by callback. Maplimit (arr, limit, iterator, [callback]) Async.maplimit (Newmovielinkarr, 5, function (Myurl, callback) {Fetchu
  RL (Myurl, callback);
    }, function (err, result) {//The callback after the end of the crawler, can do some statistical results console.log (' Grab the packet end, grab the--> ' +newmovielinkarr.length+ ');
    Console.log (' Error--> ' +errlength.length+ ' data ');
    Console.log (' high score movie: = = ' +highscoremoviearr.length ');
  return false;
  
}); });

The first is async.maplimit to all the details of the page made a concurrent, concurrent number of 5, and then crawl the details page, crawling details of the page process is the same as the process of climbing the homepage, so here do not do too much introduction, and then the useful information printed on the page.

5. Following the execution of the command, the following figure is shown:

Browser interface:

In this way, our crawler's slightly upgraded version is complete. Perhaps the article is not very clear, I have uploaded the code to the GitHub, you can run the code again, so it is easier to understand. If you have time later, you may be able to get an upgraded version of the crawler, such as storing the crawled information in MongoDB, and then displaying it on another page. And the crawler's program adds a timer, timing to crawl.

Note: If run in the browser in Chinese garbled words, you can Google's code set to Utf-8 to solve;

Code Address: Https://github.com/xianyulaodi/mySpider2

In the wrong place, please point out

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.