Background: Female votes participate in a Jane's writing training camp and serve as a class leader, each weekend to count the number of articles and articles written by each student last week. In order to get women tickets happy, so she promised to write a simple book crawler, used for statistical data, after all, I was engaged in this line, can be solved by the program is determined not to do that kind of repetitive physical activity.
Below, I came up into the writing of this reptile
First, pre-analysis:
First, let's analyze how we're going to crawl:
The first step:
We need to enter the user Center for each learner, similar to this link: http://www.jianshu.com/u/eeb221fb7dac
This can only be done by female votes to collect links to the user center of each student in their class. With this User Center link, our first step is to crawl the user center.
Step Two:
To crawl the user center, we need to get some data. For example: We count the last week's data, so we need a link collection, user name for the article details in a certain time period.
How is this data available? As shown in: Get username, we need $ ('. Nickname ') text here. To get the article details link for a time period, we need to traverse $ ('. Time ') and $ ('. Titile ') in $ ('. Note-list li ').
Step Three:
Crawl the second step to get the article details link, crawl the contents of the article details, the article is similar to the link:http://www.jianshu.com/p/aeaa1f2a0196
In the article details page, get the data we need, such as: title, word count, number of readings, etc. As shown
Fourth Step:
Generate Excel tables for the data you get, so women's tickets will worship you more, wow, Shu.
Second, pre-preparation:
Often above the analysis, let's prepare the crawler for the tools needed:
Cheerio: Lets you manipulate crawl-fetch pages as you would with jquery
Superagent-charset: Fix crawled page encoding problem
Superagent: Used to initiate a request
Async: For asynchronous processes and concurrency control
Ejsexcel: Used to generate Excel tables
Express, Node >=6.0.0
The specific use of these modules, can be found here,https://www.npmjs.com , some of the use of the following will be said
Third, start the crawler
1. We put the Short Book User Center link for all learners in the profile config.js and defined the resulting Excel table storage path:
const path = require(‘path‘);module.exports = { excelFile: { path: path.join(__dirname, ‘public/excel/‘) }, data: [ {name:"糕小糕",url:"http://www.jianshu.com/u/eeb221fb7dac"}, ]}
Because you want to protect other people's information, so you can go to Jane's book to find someone else's user center, create more data
2. We first define some global variables, such as: base link, current concurrency number, fetch wrong link collection
"http://www.jianshu.com", _currentCount = 0, _errorUrls = [];
3. Encapsulate some functions:
Encapsulates the superagent function const FETCHURL = (Url,callback) and {Let Fetchstart = new Date (). GetTime (); superagent. Get (URL). CharSet ( ' Utf-8 ') .end ((err, ssres) = { if (err) { &NB Sp;_errorurls.push (URL); console.log ( ' crawl ' + URL + ' "error '); return false, } let Spendtim E = new Date (). GetTime ()-Fetchstart; console.log ( ' crawl: ' + URL + ' success, time-consuming: ' + spendtime + ' milliseconds, now concurrent number: ' + _currentcount); _currentCount--; callback (Ssres.text); }); }//Array de-const REMOVESAME = (arr) + { const NEWARR = []; const obj = {}; arr.fo REach (item) = { if (!obj[item.title]) { newarr.push (item); &NBSP ; Obj[item.title] = Item.title; } }); &NBSP return NEWARR;}
4. Start crawling the user Center to get a link to the article details for a certain time period
// 爬取用户中心,获取某个时间段文章详情链接const crawlUserCenter = (res,startTime,endTime) => { //startTime,endTime来自于用户的ajax请求时间 const centerUrlArr = config.data; async.mapLimit(centerUrlArr, 5, (elem, callback) => { _currentCount++; fetchUrl(elem.url, (html) => { const $ = cheerio.load(html); const detailUrlArr = getDetailUrlCollections($,startTime,endTime); callback(null,detailUrlArr); //callback是必须的 }); }, (err, detailUrlArr) => { // 并发结束后的结果 ,需要由 [[abc,def],[hij,xxx]] => [abc,def,hij,xxx] _currentCount = 0; crawArticleDetail(detailUrlArr,res); return false; }); }
Here, the User Center link is from the confg file, we use Async.maplimit to do a concurrency control of our crawl, the maximum number of concurrency is 5.
Async.maplimit usage: maplimit (arr, limit, iterator, callback);
ARR: arrays
Limit: Number of concurrent
Iterator: Iterator (processing function), here refers to crawling a user center, which callback must be executed, the execution results are stored.
Callback: Callback after execution is complete. All the user centers crawl and fetch the total results back inside the callback.
Get a collection of article details for a time period from the User center:
// 获取某个时间段文章链接集合const getDetailUrlCollections = ($,startTime,endTime) => { let articleList = $(‘#list-container .note-list li‘), detailUrlCollections = []; for(let i = 0,len = articleList.length; i < len; i++) { let crateAt = articleList.eq(i).find(‘.author .time‘).attr(‘data-shared-at‘); let createTime = new Date(crateAt).getTime(); if(createTime >= startTime && createTime <= endTime) { let articleUrl = articleList.eq(i).find(‘.title‘).attr(‘href‘); let url = _baseUrl + articleUrl; detailUrlCollections.push(url); } } return detailUrlCollections;}
5. From the 4th step, we get all the links to the article details, so, let's crawl through the content of the article details, the procedure is similar to the 4th step
Crawl article details Const CRAWARTICLEDETAIL = (detailurls,res) = { const Detailurlarr = Spreaddetailurl (detailurls);   ; Async.maplimit (Detailurlarr, 5, (URL, callback) = { _currentcount + +; fetchurl (URL, (HTML) = = { const $ = cheerio.load (Html,{decodeentities:false}); const data = { title: $ ('. Article. Title '). HTML (), wordage: $ ('. Article. Wordage ') ). HTML (), publishtime: $ ('. Article. Publish-time '). html (), author: $ ('. Author. Name a '). html () }; callback (null,data); }); }, (err, resdata) = { let result = Removesame (resdata); const sumupdata = Sumupresult ( result); res.json ({ data:result, sumupdata:sumupdata }); createexcel (resUlt,sumupdata); console.info (' Crawl data complete, total crawl: ' + result.length + ' article, wherein, the error number is: ' + _errorurls.length + '); if (_errorurls.length > 0) { console.info (' Wrong URL: ' + _errorurls.join (', ')); & nbsp;} return false; });} [[abc,def],[hij,xxx]] = [Abc,def,hij,xxx]const spreaddetailurl= (urls) = { const urlcollections = []; &nb Sp;urls.foreach (item) = { item.foreach (URL) = { urlcollections.push (URL); nbsp }) }); return urlcollections;}
From the article details, we get the title, Word count, release time, of course, you can get the information that the page you want, I don't need too much here, this is enough. The data obtained here is duplicated, so the array is to be re-processed.
6. At this point, we need to crawl the data to get to the last step, we will generate Excel table
Using node to generate Excel table, find a lap, found ejsexcel this frame evaluation is better. Click to view:https://www.npmjs.com/package/ejsexcel
It is used as follows: We need to have an Excel table template, and then in the template, you can use the EJS syntax, to follow our meaning to generate the table
"/report.xlsx"); //数据源 const data = [ [{"table_name":"7班简书统计表","date": formatTime()}], dataArr, sumUpData ]; //用数据源(对象)data渲染Excel模板 ejsExcel.renderExcel(exlBuf, data) .then(function(exlBuf2) { fs.writeFileSync(config.excelFile.path + "/report2.xlsx", exlBuf2); console.log("生成excel表成功"); }).catch(function(err) { console.error(‘生成excel表失败‘); });}
First read our template, then write our data to the template, generate a new Excel table
Excel templates
7. Because the time of the crawl is not fixed, so we make the whole process of the crawl by the front-end Ajax request form, passing startTime and endtime the two data, the front-end interface is relatively simple
Front-End Interface
Generated Excel table diagram:
Front-End Interface
The crawl process is as follows:
At this point, our book crawler is complete, relatively simple.
Note the point:
Because the user can only crawl the first page of data in the heart, so the choice of time, it is best to choose the last week time. And when Excel forms are generated, Excel is turned off, otherwise the Excel table will not be generated successfully.
The code is on GitHub, welcome (I hope Jane will not pull me black): Https://github.com/xianyulaodi/jianshu_spider
By the name of the reptile: Bumblebee.
Http://www.aibbt.com/a/19374.html
Python Pinterest crawler