Python Pinterest crawler

Source: Internet
Author: User

Background: Female votes participate in a Jane's writing training camp and serve as a class leader, each weekend to count the number of articles and articles written by each student last week. In order to get women tickets happy, so she promised to write a simple book crawler, used for statistical data, after all, I was engaged in this line, can be solved by the program is determined not to do that kind of repetitive physical activity.
Below, I came up into the writing of this reptile

First, pre-analysis:

First, let's analyze how we're going to crawl:

The first step:

We need to enter the user Center for each learner, similar to this link: http://www.jianshu.com/u/eeb221fb7dac

This can only be done by female votes to collect links to the user center of each student in their class. With this User Center link, our first step is to crawl the user center.

Step Two:

To crawl the user center, we need to get some data. For example: We count the last week's data, so we need a link collection, user name for the article details in a certain time period.

How is this data available? As shown in: Get username, we need $ ('. Nickname ') text here. To get the article details link for a time period, we need to traverse $ ('. Time ') and $ ('. Titile ') in $ ('. Note-list li ').

Step Three:

Crawl the second step to get the article details link, crawl the contents of the article details, the article is similar to the link:http://www.jianshu.com/p/aeaa1f2a0196

In the article details page, get the data we need, such as: title, word count, number of readings, etc. As shown

Fourth Step:

Generate Excel tables for the data you get, so women's tickets will worship you more, wow, Shu.

Second, pre-preparation:

Often above the analysis, let's prepare the crawler for the tools needed:

Cheerio: Lets you manipulate crawl-fetch pages as you would with jquery

Superagent-charset: Fix crawled page encoding problem

Superagent: Used to initiate a request

Async: For asynchronous processes and concurrency control

Ejsexcel: Used to generate Excel tables

Express, Node >=6.0.0

The specific use of these modules, can be found here,https://www.npmjs.com , some of the use of the following will be said

Third, start the crawler

1. We put the Short Book User Center link for all learners in the profile config.js and defined the resulting Excel table storage path:

const path = require(‘path‘);module.exports = {  excelFile: {    path: path.join(__dirname, ‘public/excel/‘)  },    data: [    {name:"糕小糕",url:"http://www.jianshu.com/u/eeb221fb7dac"},  ]}

Because you want to protect other people's information, so you can go to Jane's book to find someone else's user center, create more data

2. We first define some global variables, such as: base link, current concurrency number, fetch wrong link collection

"http://www.jianshu.com",    _currentCount = 0,    _errorUrls = [];

3. Encapsulate some functions:

Encapsulates the superagent function const FETCHURL = (Url,callback) and {Let Fetchstart = new Date (). GetTime (); superagent. Get (URL). CharSet ( ' Utf-8 ')    .end ((err, ssres) = {         if (err) {      &NB Sp;_errorurls.push (URL);         console.log ( ' crawl ' + URL +  ' "error ');          return false,            }      let Spendtim E = new Date (). GetTime ()-Fetchstart;      console.log ( ' crawl: ' + URL +  ' success, time-consuming: ' + spendtime +  ' milliseconds, now concurrent number: ' + _currentcount);      _currentCount--;      callback (Ssres.text);  });      }//Array de-const REMOVESAME = (arr) + { const NEWARR = [];  const obj = {};  arr.fo REach (item) = {     if (!obj[item.title]) {     newarr.push (item);     &NBSP ; Obj[item.title] = Item.title;    }  }); &NBSP  return NEWARR;}          

4. Start crawling the user Center to get a link to the article details for a certain time period

// 爬取用户中心,获取某个时间段文章详情链接const crawlUserCenter = (res,startTime,endTime) => {  //startTime,endTime来自于用户的ajax请求时间  const centerUrlArr = config.data;  async.mapLimit(centerUrlArr, 5, (elem, callback) => {     _currentCount++;    fetchUrl(elem.url, (html) => {      const $ = cheerio.load(html);      const detailUrlArr = getDetailUrlCollections($,startTime,endTime);       callback(null,detailUrlArr);  //callback是必须的    });  }, (err, detailUrlArr) => { // 并发结束后的结果 ,需要由 [[abc,def],[hij,xxx]] => [abc,def,hij,xxx]    _currentCount = 0;    crawArticleDetail(detailUrlArr,res);    return false;  });   }

Here, the User Center link is from the confg file, we use Async.maplimit to do a concurrency control of our crawl, the maximum number of concurrency is 5.

Async.maplimit usage: maplimit (arr, limit, iterator, callback);
ARR: arrays
Limit: Number of concurrent
Iterator: Iterator (processing function), here refers to crawling a user center, which callback must be executed, the execution results are stored.
Callback: Callback after execution is complete. All the user centers crawl and fetch the total results back inside the callback.

Get a collection of article details for a time period from the User center:

// 获取某个时间段文章链接集合const getDetailUrlCollections = ($,startTime,endTime) => {  let articleList = $(‘#list-container .note-list li‘),      detailUrlCollections = [];  for(let i = 0,len = articleList.length; i < len; i++) {    let crateAt = articleList.eq(i).find(‘.author .time‘).attr(‘data-shared-at‘);    let createTime = new Date(crateAt).getTime();    if(createTime >= startTime && createTime <= endTime) {      let articleUrl = articleList.eq(i).find(‘.title‘).attr(‘href‘);      let url = _baseUrl + articleUrl;      detailUrlCollections.push(url);    }  }  return detailUrlCollections;}

5. From the 4th step, we get all the links to the article details, so, let's crawl through the content of the article details, the procedure is similar to the 4th step

Crawl article details Const CRAWARTICLEDETAIL = (detailurls,res) = { const Detailurlarr = Spreaddetailurl (detailurls); &nbsp  ; Async.maplimit (Detailurlarr, 5, (URL, callback) = {   _currentcount + +;    fetchurl (URL, (HTML) = = {     const $ = cheerio.load (Html,{decodeentities:false});      const  data = {       title: $ ('. Article. Title '). HTML (),        wordage: $ ('. Article. Wordage ') ). HTML (),        publishtime: $ ('. Article. Publish-time '). html (),        author:  $ ('. Author. Name a '). html ()      };      callback (null,data);      });  }, (err, resdata) = {   let result = Removesame (resdata);    const sumupdata = Sumupresult ( result);    res.json ({     data:result,      sumupdata:sumupdata    });    createexcel (resUlt,sumupdata);    console.info (' Crawl data complete, total crawl: ' + result.length + ' article, wherein, the error number is: ' + _errorurls.length + ');    if (_errorurls.length > 0) {     console.info (' Wrong URL: ' + _errorurls.join (', '));   & nbsp;}    return false;  });} [[abc,def],[hij,xxx]] = [Abc,def,hij,xxx]const spreaddetailurl= (urls) = { const urlcollections = []; &nb Sp;urls.foreach (item) = {   item.foreach (URL) = {     urlcollections.push (URL); nbsp  })  });    return urlcollections;}

From the article details, we get the title, Word count, release time, of course, you can get the information that the page you want, I don't need too much here, this is enough. The data obtained here is duplicated, so the array is to be re-processed.

6. At this point, we need to crawl the data to get to the last step, we will generate Excel table

Using node to generate Excel table, find a lap, found ejsexcel this frame evaluation is better. Click to view:https://www.npmjs.com/package/ejsexcel

It is used as follows: We need to have an Excel table template, and then in the template, you can use the EJS syntax, to follow our meaning to generate the table

"/report.xlsx");  //数据源  const data = [ [{"table_name":"7班简书统计表","date": formatTime()}], dataArr, sumUpData ];  //用数据源(对象)data渲染Excel模板  ejsExcel.renderExcel(exlBuf, data)  .then(function(exlBuf2) {      fs.writeFileSync(config.excelFile.path + "/report2.xlsx", exlBuf2);      console.log("生成excel表成功");  }).catch(function(err) {      console.error(‘生成excel表失败‘);  });}

First read our template, then write our data to the template, generate a new Excel table

Excel templates

7. Because the time of the crawl is not fixed, so we make the whole process of the crawl by the front-end Ajax request form, passing startTime and endtime the two data, the front-end interface is relatively simple

Front-End Interface

Generated Excel table diagram:

Front-End Interface

The crawl process is as follows:

At this point, our book crawler is complete, relatively simple.

Note the point:

Because the user can only crawl the first page of data in the heart, so the choice of time, it is best to choose the last week time. And when Excel forms are generated, Excel is turned off, otherwise the Excel table will not be generated successfully.

The code is on GitHub, welcome (I hope Jane will not pull me black): Https://github.com/xianyulaodi/jianshu_spider

By the name of the reptile: Bumblebee.

Http://www.aibbt.com/a/19374.html

Python Pinterest crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.