Today I want to crawl the background data from a website, encountered a lot of obstacles in the middle, took 2 hours to request data, so I summed up some experience here.
First, put on the request address I crawled http://api.chuchujie.com/api/?v=1.0;
Let's start crawling the data.
I. Write a crawler based on Nodejs.
1. Introduce the required modules
There is a need to introduce the HTTP module (the module that NODEJS uses to send HTTP requests to the browser) and the QueryString module (to convert the parameters of the object in front of the foreground into string form);
var http = require ("http"); HTTP request//var HTTPS = require ("https"); HTTPS request var querystring = require ("querystring");
2. Configuring the Http.router (OPTIONS,FN) parameter options
In the configuration, the emphasis is on simulating the browser request header , generally must simulate cookie,user-agent (access device system), Content-type, some need to simulate more. Here, we don't have a cookie, so we don't have to pass it.
3. Send the HTTP POST request to the target background to get the data
var req = http.request (options, function (res) { var json = ""; define JSON variables to receive data from the server console.log (res.statuscode); The Res.on method listens to the data to return this process, the "data" parameter represents the number of data received in the process of a little bit back, the chunk represents a data res.on ("Data", function ( Chunk) { json + = chunk;//json is a concatenation of data ) //"End" is the end of the listening data return, callback (JSON) using callback to pass the parameter to the background results and back to the foreground Res.on ("End", function () { callback (JSON); }) }) Req.on ("Error", function () { console.log (' Error ') })//This is a style of the foreground parameter, here the parameters param by the backend routing module, While the backend routing module parameters are from the foreground// var obj = {// query: ' {' function ': ' Newest ', ' Module ': ' ZDM '} ',// client: ' {' Gender ' : "0"} ',// page:1//} req.write (querystring.stringify (param));//POST request req.end ();// You have to write,
4. Modular Export
Complete Spider Code
/** * Created by Administrator on 2017/2/12. */var http = require ("http"); HTTP request//var HTTPS = require ("https"); HTTPS request var querystring = require ("querystring"); function request (Path,param,callback) {var options = {Hostn Ame: ' api.chuchujie.com ', port:80,//port number HTTPS default ports 443, HTTP default port number is Path:path, method: ' POST ', Headers: {"Connection": "Keep-alive", "Content-length": 111, "Content-type": "Appli cation/x-www-form-urlencoded; Charset=utf-8 "," user-agent ":" mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 "}//forged request Header}; var req = http.request (options, function (res) {var json = ""; Define JSON variables to receive data from the server Console.log (Res.statuscode); The Res.on method listens to the data to return this process, the "data" parameter represents the number of data received in the process of a little bit back, the chunk represents a data Res.on ("Data", function (chunk) { JSON + = chunk; JSON is stitched together by a piece of data})//"End "is the end of the listener data Return, callback (JSON) using callback to pass the parameter to the background results and return to the foreground res.on (" End ", function () {callback (JSON); })}) Req.on ("Error", function () {console.log (' error ')})//This is a style of the foreground parameter, where the parameters param by the backend routing module, and the backend routing module parameters are The reception came in//var obj = {///query: ' {' function ': ' Newest ', ' Module ': ' ZDM '} ',//client: ' {' Gender ': ' 0 '} ',// page:1//} req.write (Querystring.stringify (param)); Post request parameter req.end (); Must be written,}module.exports = Request;
Crawling JSON data based on Nodejs simulation browser POST request