Powerful crawlers Based on Node. js can directly publish captured articles.

Source: Internet
Author: User
Powerful crawlers Based on Node. js can directly publish captured articles! The source code of this crawler is based on the WTFPL protocol. For more information, see I. Environment Configuration

1) A server can work on any linux server. I use CentOS 6.5;

2) install a mysql database, which can be 5.5 or 5.6. You can directly use lnmp or lamp to install the database, and you can directly view the log in the browser;

3) first install a node. js environment. I used 0.12.7, but I did not try the later version;

4) execute npm-g install forever and install forever so that crawlers can run in the background;

5) execute all the code locally (full = git clone );

6) execute npm install to install the dependent libraries in the project directory;

7) create two empty folders, json and avatar, in the project directory;

8) Create an empty mysql database and a user with full permissions. execute setup. SQL and startusers. SQL in the code, create the database structure, and import the initial subaccount;

9) EDIT config. js to indicate that (required) the configuration items must be filled in or modified. The remaining items can not be changed for the time being:

Exports. jsonPath = ". /json/"; // The Path to generate the json file exports. avatarPath = ". /avatar/"; // Save the path exports of the avatar file. dbconfig = {host: 'localhost', // Database Server (required) user: 'dbuser', // database username (required) password: 'dbpassword ', // database Password (required) database: 'dbname', // database Name (required) port: 3306, // database server port poolSize: 20, acquireTimeout: 30000}; exports. urlpre = "http://www.jb51.net/"; // Script URL exports. urlzhuanlanpre = "http://www.jb51.net/list/index_96.htm/"; // Script URL exports. WPurl = "www.xxx.com"; // the wordpress website address exports. WPusername = "publishuser"; // the user name for publishing the article exports. WPpassword = "publishpassword"; // publish the password of the article user exports. WPurlavatarpre = "http://www.xxx.com/avatar/"; // The url address exports to replace the original profile in the published article. mailservice = "QQ"; // mail Notification Service type. You can also use Gmail, provided that you have accessed Gmail (required) exports. mailuser = "12345@qq.com"; // mailbox User Name (required) exports. mailpass = "qqpassword"; // The email password (required) exports. mailfrom = "12345@qq.com"; // mail address (required, generally the same as the user name's mailbox) exports. mailto = "12345@qq.com"; // receive notification email addresses (required)

Save and go to the next step.

2. crawler users

The principle of crawler is to simulate a truly knowledgeable user who clicks on the website and collects data. Therefore, we need a truly knowledgeable user. You can use your own account for testing. But in the long run, register a small account. Only one crawler currently supports one. In our simulation process, we do not have to log on from the home page as a real user, but directly borrow the cookie value:

Register, activate, and log on to your homepage. Open your cookies in zhihu using any browser with the developer mode or the cookie plug-in. There may be a complicated string, but we only need part of it, that is, "z_c0 」. Copy the z_c0 part of your cookie. Do not drop the equal sign, quotation marks, or semicolons. The last format is roughly as follows:

z_c0="LA8kJIJFdDSOA883wkUGJIRE8jVNKSOQfB9430=|1420113988|a6ea18bc1b23ea469e3b5fb2e33c2828439cb";

Insert a row of records in the cookies table of the mysql database. The values of each field are:

  • Email: the login email address of the crawler
  • Password: password of the crawler
  • Name: crawler User name
  • Hash: crawler user's hash (a unique identifier that cannot be modified by each user, which is not used here. It can be left blank for the time being)
  • Cookie: the cookie you copied just now

Then you can start running. If the cookie is invalid or the user is blocked, modify the cookie field of the record.

Iii. Running

We recommend that you use forever for execution. This not only facilitates background operation and logging, but also automatically restarts after a crash. Example:

forever -l /var/www/log.txt index.js

Here, the address after-l is the place to record the log, if placed in the web Server Directory, you can directly view the log in the browser through the http://www.xxx.com/log.txt. Add parameters (separated by spaces) after index. js to execute different crawler commands:
1.-I immediate execution. If this parameter is not added, it is executed at the next specified time by default, for example, every day;
2.-ng skips the capture of new users, that is, getnewuser;
3.-ns skips the snapshot phase, that is, usersnapshot;
4.-nf skips the data file generation phase, namely, saveviewfile;
5. debug logs are displayed in-db.
The functions of each stage are described in the next section. For ease of running, you can write this line of command as a sh script, for example:

#!/bin/bashcd /usr/zhihuspiderrm -f /var/www/log.txtforever -l /var/www/log.txt start index.js $*

Replace the path with your own. In this way, you can pass. /zhihuspider. sh and parameters to enable crawler: for example. /zhihuspider. sh-I-ng-nf means to start the task immediately, skip the new user and save the file. The method for stopping a crawler is forever stopall (or stop serial number ).

Iv. Principle Overview

The crawler's entry file is index. js. It executes a crawler task at a specified time every day in a circular manner. There are three tasks executed sequentially every day:

1) getnewuser. js:By comparing the list of users in the current database and capturing new user information, this mechanism can automatically include new users worth attention in the database;

2) usersnapshot. js:The user information and answers in the current database are collected cyclically and saved as daily snapshots.

3) saveviewfile. js:Based on the last snapshot content, a user analysis list is generated, and the best answers of yesterday, recent days, and history are filtered out and published to the Zhizhi website.

After the preceding three tasks are completed, the main thread refresh the zhihu homepage every several minutes to verify whether the current cookie is still valid. If it fails (skip to the Unlogged-On page ), A notification email is sent to the specified mailbox, prompting you to change the cookie in time. The method for changing the cookie is the same as that during initialization. You only need to manually log on once and retrieve the cookie value. If you are interested in the specific code implementation, you can take a closer look at the comments in it, adjust some configurations, and even try to refactor the entire crawler yourself.

Tips

1) The principle of getnewuser is to compare the number of users' followers in the two-day snapshots before and after the specified capture, so you must have at least two snapshots before starting, previously, it will be skipped even if it is executed.

2) Half of the snapshots can be recovered. If the program crashes, stop it with forever stop, and add the parameter-I-ng to execute immediately and skip the new user stage to continue with the half snapshot taken just now.

3) do not increase the number of (pseudo) threads during snapshot capturing, that is, the maxthreadcount attribute in usersnapshots. Too many threads will cause a 429 error. At the same time, a large amount of data captured may be too late to be written to the database, resulting in memory overflow. Therefore, unless your database is deployed on an SSD, there should be no more than 10 threads.

4) It takes at least seven days for saveviewfile to generate an analysis result. If the snapshot content is less than seven days, an error is reported and skipped. You can manually query the database for previous analysis.

5) Considering that most people do not need to copy a "zhizhihu", the function entry of the automatically published wordpress article has been commented out. If you have set up wordpress, remember to enable xmlrpc and set up a User Dedicated to publishing an article. Configure the relevant parameters in config. js and uncomment the relevant code in the saveviewfile.

6) Because zhihu has implemented anti-leech protection for the Avatar, we also acquired the Avatar while capturing the user information, saved it locally, and used the local Avatar address when publishing the article. You need to direct the url path to the folder where the Avatar is saved on the http server, or put the folder where the Avatar is saved directly under the website directory.

7) the Code may not be easy to understand. Besides node. in addition to the chaotic callback structure of js, some of the reasons are that I was just getting started with node when I first wrote the program. js, there are a lot of unfamiliar places that lead to structural confusion and cannot be corrected; the other part is that many ugly judgment conditions and retry rules are added in multiple sewing and makeup. If all the rules are removed, the amount of code may decrease by 2/3. But there is no way to do this. To ensure the stable operation of a system, you must add these.

8) this crawler source code is based on the WTFPL protocol and does not impose any restrictions on modification or release.

The above is all the content of this article, hoping to help you learn.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.