First, the Environment configuration
1 to engage in a server, what Linux is OK, I use the CentOS 6.5;
2 installed a MySQL database, 5.5 or 5.6 can be, the diagram can be directly used LNMP or lamp to install, back can also directly in the browser to see the log;
3 First, an node.js environment, I use the 0.12.7, the later version has not tried;
4 Execute npm-g Install forever, install forever so that reptiles run in the background;
5 The entire code to the local (whole =git clone);
6 Implement NPM install installation dependency library under the project directory;
7 Create JSON and avatar two empty folders under the project directory;
8 Establish an empty MySQL database and a user with full authority, execute the Setup.sql and Startusers.sql in the code successively, create the database structure and import the initial seed user;
9 edit Config.js, indicate (must) the configuration items must be filled in or modified, the remaining items can be temporarily not changed:
Exports.jsonpath = "./json/";//generate the path of the json file
exports.avatarpath = "./avatar/";//Save the path of the avatar file
exports.dbconfig = {
host: ' localhost ',//database server (MUST) User
: ' dbuser ',//database username (must)
password: ' Dbpassword ',//Database password (required)
DB: ' dbname ',//database name (MUST)
port:3306,//database server port
poolsize:20,
acquiretimeout:30000
};
Exports.urlpre = "http://www.jb51.net/";//script URL
exports.urlzhuanlanpre = "http://www.jb51.net/list/index_96. htm/";//script URL
exports. Wpurl = "www.xxx.com";//WordPress website address exports to post the article
. Wpusername = "Publishuser";//the user name exports the article is published
. Wppassword = "Publishpassword";//Publish the password for the article user
exports. Wpurlavatarpre = "http://www.xxx.com/avatar/";//Publish an article in lieu of the original Avatar URL address
exports.mailservice = "QQ";//Mail notification service type, You can also use Gmail, if you have access to Gmail (must)
Exports.mailuser = "12345@qq.com";//mailbox username (must)
Exports.mailpass = "Qqpassword ";//mailbox password (must)
Exports.mailfrom =" 12345@qq.com ";//Send email address (must, generally consistent with user name of mailbox)
Exports.mailto =" 12345@qq.com " ;//Receive notification mail address (required)
Save, and then go to the next step.
Second, the crawler user
The principle of the crawler is actually to simulate a real user on the site to point to and collect data, so we need to have a really know the user. In order to test can use your own account, but in the long run, or register a small bar, one is enough, the current crawler only support one. Our simulation process does not have to be logged in as a real user from the homepage, but instead borrows the cookie value directly:
After registering for activation login, go to your home page and open your own cookies using any browser that has developer mode or view cookie plug-ins. There may be a lot of complicated, but we just need a part of it, namely "z_c0". Copy the z_c0 part of your own cookie, not even the equals sign, quotes, or semicolons, and the final format is basically this:
Z_c0= "LA8KJIJFDDSOA883WKUGJIRE8JVNKSOQFB9430=|1420113988|A6EA18BC1B23EA469E3B5FB2E33C2828439CB";
Insert a line of records in the Cookies table in the MySQL database, where the values for each field are:
- Email: Crawler User's login email
- Password: Crawler user's password
- Name: Crawler User Name
- Hash: Reptile user's hash (each user can not modify the unique identification, in fact, this is not used, you may temporarily leave blank)
- Cookie: The cookie you just copied
Then it can be officially started. If the cookie is invalidated or the user is blocked, modify the cookie field of the line record directly.
Third, the operation
It is recommended to perform with forever, which not only facilitates background running and logging, but also restarts automatically after a crash. Example:
Forever-l/var/www/log.txt Index.js
The post-l address is where the log is recorded, and if placed in a Web server directory, it can be viewed directly through Http://www.xxx.com/log.txt in the browser. The following index.js (separated by a space) can perform different crawler instructions:
1.-I execute immediately, if not with this parameter, default at the next specified time, such as 0:05 every day;
2,-ng skip the new user stage, that is, getnewuser;
3,-ns Skip the snapshot phase, namely Usersnapshot;
4,-NF Skip the generation of data file stage, that is, saveviewfile;
5,-db display debug log.
The features of each phase are described in the next section. For ease of running, you can write this line of command as an SH script, for example:
#!/bin/bash
cd/usr/zhihuspider
rm-f/var/www/log.txt
forever-l/var/www/log.txt start index.js $*
Please replace the specific path for your own. This allows you to start the crawler by using the/zhihuspider.sh parameter: for example./zhihuspider.sh-i-NG-NF is the task of starting immediately, skipping new users, and saving file stages. The way to stop a reptile is to forever StopAll (or stop serial number).
Iv. Overview of the Principles
The entry file for the crawler is index.js. It executes the reptilian task at a specified time each day in a circular fashion. There are three tasks that are executed in sequence each day, respectively:
1) Getnewuser.js: through the current library of the user's list of users of the comparison, crawling new user information, relying on this mechanism can automatically be aware of the new people to be concerned into the library;
2) Usersnapshot.js: iterates through the current library's user data and answer list, and saves it in the daily snapshot form.
3) Saveviewfile.js: based on the latest snapshot content, generate user analysis list, and filter out yesterday, recent and historical essence of the answer published to the "See" site.
After the completion of the above three tasks, the main thread will refresh every few minutes to know the home page, verify that the current cookie is still valid, if the failure (skip to the landing page), you will send a notification message to the specified mailbox, reminding you to replace the cookie in time. The method of replacing the cookie is the same as when it was initialized, just log in manually once and then remove the cookie value. If you are interested in specific code implementation, you can look at the comments inside, adjust some configurations, and even try to refactor the entire crawler yourself.
Tips
1 The principle of getnewuser is to specify the crawl by comparing the number of users ' attention before and after the two-day snapshot, so you must have at least two snapshots before you can start, even if the execution will automatically skip.
2 capture half of the snapshot can be recovered. If the program crashes, stop it with forever stop, and then add the parameter-i-ng, immediately execute and skip the new user phase to continue from the snapshot that just captured half of it.
3) do not easily increase the number of (pseudo) threads in the snapshot capture, that is, the Maxthreadcount attribute in Usersnapshots. Too many threads can cause a 429 error, and a large amount of data crawled back may be too late to write to the database causing a memory overflow. So, unless your database is on SSDs, there are no more than 10 threads.
4) Saveviewfile generate the results of the analysis requires at least 7 days of snapshots to do, if the snapshot content less than 7 days will error and skip. Previous analytical work can be done manually by querying the database.
5) Taking into account that most people do not need to copy a "see", has been automatically published WordPress article function entry comments dropped. If you build WordPress, remember to turn on XMLRPC, and then set up a user dedicated to publishing the article, configure the corresponding parameters in Config.js, and uncomment the related code in Saveviewfile.
6 because of the knowledge of the Avatar did the anti-theft chain processing, we crawl user information together will also get down the head, save in the local, publish the article using the local avatar address. You need to point the URL path to the folder where the avatar is saved in the HTTP server, or place the folder where the avatar is saved directly to the site directory.
7 code may not be easy to read. In addition to the node.js of the callback structure itself is rather confusing, part of the reason is that when I first wrote the program I just started to contact with Node.js, there are a lot of unfamiliar places to cause structural chaos did not have time to correct, the other part of the multiple sewing in the accumulation of many ugly judgment conditions and retry rules, if all removed, the amount of code may be reduced Two thirds. But this is no way, in order to ensure a stable operation of the system, must join these.
8 The source of the crawler based on the WTFPL protocol, do not modify and publish any restrictions.
The above is the entire content of this article, I hope to help you learn.