A powerful crawler based on node. JS can publish crawled articles directly.

Last Update:2016-07-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A powerful crawler based on node. JS can publish crawled articles directly. The powerful crawler based on node. JS can publish crawled articles directly! This crawler source based on the WTFPL protocol, interested in small partners can refer to

First, the Environment configuration

1) Make a server, what Linux is OK, I use CentOS 6.5;

2) Install a MySQL database, 5.5 or 5.6 can be, the diagram can be used directly with LNMP or lamp to install, back can also be directly in the browser to see the log;

3) first a node. JS environment, I use is 0.12.7, more rely on the later version did not try;

4) Execute npm-g Install forever, installation forever so that the crawler in the background run;

5) Complete all the code locally (whole =git clone);

6) Execute the NPM install dependent library under the project directory;

7) Create JSON and avatar two empty folders under the project directory;

8) Establish an empty MySQL database and a user with full privileges, execute the Setup.sql and Startusers.sql in the code, create the database structure and import the initial seed user;

9) Edit the Config.js, indicating (must) the configuration items must be filled or modified, the remaining items can be temporarily unchanged:

Exports.jsonpath = "./json/";//the path to generate the json file Exports.avatarpath = "./avatar/";//the path to save the avatar file Exports.dbconfig = {  Host: ' localhost ',//database server (MUST) User  : ' dbuser ',//database user name (required)  password: ' Dbpassword ',//Database password (required)  : ' DBName ',//database name (required)  port:3306,//database server port  poolsize:20,  acquiretimeout:30000};  Exports.urlpre = "http://www.jb51.net/";//script URL Exports.urlzhuanlanpre = "http://www.jb51.net/list/index_96.htm/";// Script URL  exports. Wpurl = "www.xxx.com";//the WordPress website address exports to publish the article. Wpusername = "Publishuser";//the user name of the published article exports. Wppassword = "Publishpassword";//Publish article user's password exports. Wpurlavatarpre = "http://www.xxx.com/avatar/";//The URL address that replaces the original avatar in the published article  exports.mailservice = "QQ";//e-mail notification service type, You can also use Gmail, if you have access to Gmail (must) Exports.mailuser = "[email protected]";//mailbox User name (must) Exports.mailpass = "Qqpassword";// e-mail password (must) Exports.mailfrom = "[email protected]";//Send e-mail address (must, general and user name belongs to the same mailbox) Exports.mailto = "[email protected]";// Receive notification email address (required)

Save and then go to the next step.

Second, crawler users

The crawler's principle is actually to simulate a real knowledge of the user on the site to point to and collect data, so we need to have a really know the user. In order to test can use your own account, but in the long run, or specifically register a small bar, one is enough, the current crawler only support one. Instead of having to log in from the home page like a real user, our simulation process borrows the cookie value directly:

After registering to activate your login, go to your homepage and use any browser that has developer mode or view cookie plugin to open your own cookie. There may be a lot of complexity, but we only need one part of it, namely "z_c0". Copy the z_c0 part of your own cookie, not even the equals sign, the quotation mark, the semicolon, the final format is basically this:

Z_c0= "LA8KJIJFDDSOA883WKUGJIRE8JVNKSOQFB9430=|1420113988|A6EA18BC1B23EA469E3B5FB2E33C2828439CB";

Insert a row of records in the Cookies table in the MySQL database, where each field value is:

Email: Crawler User's login mailbox
Password: Crawler user's password
Name: Crawler User Name
Hash: The user's hash of the crawler (each user can not modify the unique identity, in fact, this is not available here, temporarily left blank)
Cookie: The cookie you just copied

Then you can officially start running. If the cookie expires or the user is blocked, modify the cookie field of this line of record directly.

Third, the operation

It is recommended to do this with forever, which not only makes it easier to run and log logs in the background, but also restarts automatically after a crash. Example:

Forever-l/var/www/log.txt Index.js

Where the after-L address is the place where the log is recorded, if placed in the Web server directory, you can in the browser through the Http://www.xxx.com/log.txt to directly view the log. Adding parameters (separated by spaces) after Index.js can perform different crawler commands:
1,-I immediately execute, if not add this parameter is default at the next specified time execution, such as daily 0:05 minutes;
2,-ng skip crawl New user stage, that is, getnewuser;
3,-ns Skip the snapshot phase, that is, usersnapshot;
4,-NF Skip the generation of data file phase, that is, saveviewfile;
5,-db display debug log.
The functions of each stage are described in the next section. For ease of operation, you can write this line of command as an SH script, for example:

#!/bin/bashcd/usr/zhihuspiderrm-f/var/www/log.txtforever-l/var/www/log.txt start index.js $*

Please substitute your own for the specific path. This will enable the crawler to be opened by the./zhihuspider.sh Plus parameter: For example,/zhihuspider.sh-i-NG-NF is the time to start a task, skip a new user, and save a file. The method to stop the crawler is Forever StopAll (or stop sequence number).

Iv. Overview of the Principles

Look at the crawler's entry file is Index.js. It performs a crawler task in a circular manner at a specified time of day. There are three tasks executed in sequence each day, namely:

1) Getnewuser.js: by comparing the user's followers list in the current library, grasping the new user information, relying on this mechanism can automatically be aware of the worthy attention of the new entrants into the library;

2) Usersnapshot.js: cycle through the current in-Library user data and answer list and save it as a daily snapshot.

3) Saveviewfile.js: based on the most recent snapshot content, generate a list of user analysis, and filter out the answers to yesterday, recent and historical highlights to the "see" website.

After the completion of the above three tasks, the main thread will refresh the home page every few minutes, verify that the current cookie is still valid, if it fails (skip to the login page), will send a notification email to the designated mailbox to remind you to change the cookie in time. Changing the cookie method is consistent with initialization, just log in once and take out the cookie value. If you are interested in specific code implementation, you can look at the comments inside, adjust some configurations, and even try to reconstruct the entire crawler yourself.

Tips

1) The principle of getnewuser is to specify the crawl by comparing the number of users ' concerns in the two-day snapshot before and after, so there must be at least two snapshots before the start, even if the execution is automatically skipped.

2) Capture half of the snapshot can be recovered. If a program fails to crash, stop it with forever stop, and then add the parameter-i-ng, immediately execute and skip the new user phase to continue from the snapshot that just caught half the time.

3) do not easily increase the number of (pseudo) threads when snapshot fetching, that is, the Maxthreadcount attribute in Usersnapshots. Too many threads can cause a 429 error, and the large amount of data fetched back may be too late to write to the database causing a memory overflow. So, unless your database is on an SSD, there are no more than 10 threads.

4) Saveviewfile to generate the results of the analysis requires a snapshot of at least 7 days to proceed, if the snapshot content is less than 7 days will be error and skipped. Previous analysis work can be done manually querying the database.

5) Consider that most people do not need to copy a "look-and-feel", has been automatically published WordPress article function entrance comment out. If you have built WordPress, remember to turn on XMLRPC, then set up a user dedicated to publishing the article, configure the parameters in Config.js, and uncomment the relevant code in Saveviewfile.

6) due to the knowledge of the head of the anti-theft chain processing, we crawl the user information will also be taken down, saved in the local, the article was published using the local avatar address. You need to point the URL path to the folder where you saved your avatar in the HTTP server, or drop the folder where you saved your avatar directly into the site directory.

7) The code may not be easy to read. In addition to the confusion of node. JS's callback structure itself, part of the reason is that I was just beginning to touch node. JS when I first wrote the program, there are a lot of unfamiliar places leading to structural confusion and no time to correct it; the other part is adding many ugly judgment conditions and retry rules to multiple mend, and if all is removed, the amount of code Two thirds. But this is no way to do, in order to ensure a stable operation of the system, must join these.

8) This crawler source code is based on the WTFPL protocol and does not make any restrictions on the modification and release.

The above is the whole content of this article, I hope that everyone's study has helped.

A powerful crawler based on node. JS can publish crawled articles directly.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A powerful crawler based on node. JS can publish crawled articles directly.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A powerful crawler based on node. JS can publish crawled articles directly.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support