Source code: Https://github.com/nnngu/LagouSpider
Effect Preview
Ideas
1, first we open the hook net, and search "Java", display the job information is our goal.
2. Next we need to determine how to extract the information.
View the source code, this time found that the page source code can not find the position related information, which proves that the information on the position of the hook is asynchronous loading, which is also a very common technology.
Asynchronous loading of information, we need to use Chrome browser developer tools for analysis, the way to open developer tools is as follows:
- Click Nerwork to enter the network analysis interface, this time is a blank, refresh the interface can see a series of network requests.
In front of us, we say that the information about the position is loaded asynchronously, then in this series of network requests, there must be a request sent to the server, the response back is the job information.
Under normal circumstances, we can ignore the CSS, picture and other types of requests, the focus on XHR this type of request,
For a total of 4 xhr types of requests, we opened the comparison one by one and clicked Preview to see what they were responding to.
Find the first request is what we are looking for.
Click Headers to see the request parameters. Such as:
Here we can make sure that the city parameter is urban, the PN parameter is the number of pages, and the KD parameter is the search keyword.
Then we start writing the code.
Code
The code is divided into four parts to facilitate later maintenance.
1. Basic HTTPS requesthttps.py
This section encapsulates some of the requests packages, some of which are as follows:
#-*-Coding:utf-8-*- fromSrc.settingImportIP, UAImportRequests, randomImportLoggingclassHttp:'''HTTP request-related operations ''' def __init__( Self):Pass defGet Self, URL, headers=None, cookies=None, Proxy=None, TimeOut=5, Timeoutretry=5):'''get web page source codeURL: web linkheaders:headerscookies:cookiesProxy: AgentTimeout: Request time-outtimeoutretry: Timeout retry countreturn: Source code ''' if notUrl:logging.error (' geterror url not exit ')return ' None ' # only part of the code is shown here # Full code uploaded to GitHub
Only part of the code is shown here, and the full code is uploaded to GitHub
2. Main logic part of codemain.py
This part of the program logic is as follows:
defGetInfo (URL, para):"""Get information """Generalhttp=Http () Htmlcode=Generalhttp.post (URL, para=Para, headers=Headers, cookies=Cookies) Generalparse=Parse (Htmlcode) PageCount=Generalparse.parsepage () Info=[] forIinch Range(1,3):Print(' first%sPage ' %i) para[' PN ']= Str(i) Htmlcode=Generalhttp.post (URL, para=Para, headers=Headers, cookies=Cookies) Generalparse=Parse (Htmlcode) info=Info+Getinfodetail (generalparse) Time.sleep (2)returnInfo
defProcessInfo (info, para):"""Information Store """Logging.error (' Process start ')Try: Title= ' Company name\ tCompany Type\ tFinancing Phase\ tlabel\ tCompany Size\ tCompany Location\ tJob Type\ tAcademic Requirements\ tWelfare\ tSalary\ tWork Experience\ n' file =Codecs.Open('%sposition. xls ' %para[' City '],' W ',' Utf-8 ')file. Write (title) forPinchInfo:line= Str(p[' CompanyName '])+ '\ t' + Str(p[' Companytype '])+ '\ t' + Str(p[' Companystage '])+ '\ t' + \ Str(p[' Companylabel '])+ '\ t' + Str(p[' Companysize '])+ '\ t' + Str(p[' Companydistrict '])+ '\ t' + \ Str(p[' PositionType '])+ '\ t' + Str(p[' Positioneducation '])+ '\ t' + Str(p[' Positionadvantage '])+ '\ t' + \ Str(p[' Positionsalary '])+ '\ t' + Str(p[' Positionworkyear '])+ '\ n' file. Write (line)file. Close ()return True except Exception asE:Print(e)return None
3. Information Analysis Sectionparse.py
This section analyzes the characteristics of the job information returned by the server, as follows:
classParse:'''Parsing Web page information ''' def __init__( Self, Htmlcode): Self. Htmlcode=Htmlcode Self. JSON=Demjson.decode (Htmlcode)Pass defParsetool ( Self, content):'''Clear HTML Tags ''' if type(content)!= Str:returnContent sublist=[' <p.*?> ',' </p.*?> ',' <b.*?> ',' </b.*?> ',' <div.*?> ',' </div.*?> ',' </br> ',' <br/> ',' <ul> ',' </ul> ',' <li> ',' </li> ',' <strong> ',' </strong> ',' <table.*?> ',' <tr.*?> ',' </tr> ',' <td.*?> ',' </td> ',' \ r ','\ n',' &.*?; ',' & ','#.*?;',' <em> ',' </em> ']Try: forSubstringinch[Re.Compile(String, Re.) S forStringinchSublist]: Content=Re.sub (SUBSTRING,"", content). Strip ()except:Raise Exception(' Error ' + Str(Substring.pattern))returnContent# only part of the code is shown here # Full code uploaded to GitHub
Only part of the code is shown here, and the full code is uploaded to GitHub
4. Configuration sectionsetting.py
This part of the reason for the inclusion of cookies is to deal with the anti-crawl of the hook net, long-term use needs to be improved, dynamic cookie access
#-*-Coding:utf-8-*-# HeadersHeaders={' Host ':' www.lagou.com ',' Connection ':' keep-alive ',' Content-length ':' All ',' Origin ':' https://www.lagou.com ',' X-anit-forge-code ':' 0 ',' User-agent ':' mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 ',' Content-type ':' application/x-www-form-urlencoded; Charset=utf-8 ',' Accept ':' Application/json, Text/javascript, */*; q=0.01 ',' X-requested-with ':' XMLHttpRequest ',' X-anit-forge-token ':' None ',' Referer ':' https://www.lagou.com/jobs/list_java?city=%E5%B9%BF%E5%b7%9e&cl=false&fromsearch=true&labelwords=&suginput= ',' accept-encoding ':' gzip, deflate, BR ',' Accept-language ':' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '}
Test
Operation Result:
After the crawl is finished, the data crawled by the crawler can be seen in the SRC directory.
To this, the hook net job information crawl is completed. The full code has been uploaded to my github
Easy-to-read analysis of how to implement a small reptile in python, crawling the job information on the hook net