Use nodejs to crawl the top-end skills with no worries about the future and no worries about the future of nodejs.
I am preparing for a new job. I need to update the skill tree. To make targeted statistics on the requirements of recruiters. I learned about nodejs before, so I made a crawler to search data.
Procedure:
1. Use fiddler to analyze the headers and bodies required by the request.
2. Use superagent to construct the above data and send client requests.
3. Use cheerio to sort the returned data.
After a few nights, I only got a shelf and waited for time to continue development.
/* If you use fiddler to capture packets, you must configure the lan proxy and set the following parameters */process. env. https_proxy = "http: // 127.0.0.1: 8888"; process. env. http_proxy = "http: // FIG: 8888"; process. env. NODE_TLS_REJECT_UNAUTHORIZED = "0";/* used module */var request = require ('superagent'); // send the client request require ('superagent-proxy ') (request); // use a proxy to send the request var cheerio = require ('cheerio '); // operate the returned characters in a jq-like method, no need to use regular require ('superagent-charset') (request); // node does not support gbk, gb2312, t His will add request. request. prototype. charset. var async = require ('async'); // asynchronous Stream Control Module var fs = require ('fs');/* related parameters, copy the packet through fiddler */var ws = fs.createWriteStream('res.html ', {flags: 'W +'}); // a + append read/write mode, w + override var loginUrl = "https://login.51job.com/login.php"; var searchUrl = "http://search.51job.com/jobsearch/search_result.php "; var queryStrings = "fromJs = 1 & jobarea = 020000 & keyword = % E5 % 89% 8D % E7 % AB % AF % E5 % BC % 80% E5 % 8F % 91 & keywordtype = 2 & lang = c & stype = 2 & postchannel = 0000 & fromType = 1 & confirmdate = 9 "; var loginForms = {lang: 'C', action: 'save', from_domain: 'I', loginname: '***', // your username and password: '***', verifycode: '', isread: 'on'}; var searchForms = {lang: 'C', stype: '2', postchannel: '2016 ', fromType: '1', line: '', confirmdate: '9', from:'', keywordtype: '2', keyword: '% E5 % 89% 8D % E7 % AB % AF % E5 % BC % 8 0% E5 % 8F % 91 ', jobarea: '000000', industrytype: '', funtype :''}; var searchFormsString = 'lang = c & stype = 2 & postchannel = 0000 & fromType = 1 & line = & confirmdate = 9 & from = & keywordtype = 2 & keyword = % C7 % B0 % B6 % CB % BF % AA % B7 % A2 & jobarea = 020000 & industrytype = & funtype = '; var proxy0 = process. env. https_proxy; var proxy = process. env. http_proxy; const agent = request. the instances generated by the agent (); // agent () method save the cookie for subsequent requests. post (loginUrl ). pro Xy (proxy0 ). send (loginForms ). end (function (err, res0) {agent. post (searchUrl ). the proxy (proxy) // proxy () method must be called immediately after the method; otherwise, the fiddler cannot capture the data packet. type ('application/x-www-form-urlencoded '). query (queryStrings) // use the string format. send (searchFormsString ). charset ('gbk') // you can see the encoding character format through charset. garbled characters are not set. end (function (err, res) {/* The following is the logic code for processing returned data */var $ = cheerio. load (res. text); // res. text is the body of the returned message async. each ($ ('. el. title '). NextAll ('. el '), function (v, callback) {// Delete unnecessary content, retain position, company link $ (v ). prepend ($ (v ). find ('. t1 A'); $ (v ). find ('. t1 '). remove (); ws.write().html (v), 'utf8');}, function (err) {console. log (err) ;}); console. log ('successful ') ;}}); // The jquery built-in document element is root, and cheerio needs to be passed in through the load method. Then, the selector is used to find the specified Element and then perform the corresponding operation. // Pai.html (el); static method, returns the outerHtml // TODO // 1 of the el element. currently, only one page of data is requested. You also need to create a list of requests for all pages. // 2. send a request to the job link of each piece of data to obtain the skill keyword and store it in the file. // 3. node io operations are asynchronous and there is no lock concept. How to write data to the same file concurrently?
The result is as follows:
The above is a small Editor to introduce you to the use of nodejs to climb the 51cto skills ranking, hope to help you, if you have any questions, please leave a message, the small editor will reply to you in time!