Easy-to-read analysis of how to implement a small reptile with python, crawl the job information of the hook net

Source: Internet
Author: User

Source code: Https://github.com/nnngu/LagouSpider

Effect Preview

Ideas

1, first we open the hook net, and search "Java", display the job information is our goal.

2. Next we need to determine how to extract the information.

    • View the source code, this time found that the page source code can not find the position related information, which proves that the information on the position of the hook is asynchronous loading, which is also a very common technology.

    • Asynchronous loading of information, we need to use Chrome browser developer tools for analysis, the way to open developer tools is as follows:

    • Click Nerwork to enter the network analysis interface, this time is a blank, refresh the interface can see a series of network requests.

    • In front of us, we say that the information about the position is loaded asynchronously, then in this series of network requests, there must be a request sent to the server, the response back is the job information.

    • Under normal circumstances, we can ignore the CSS, picture and other types of requests, the focus on XHR this type of request,

For a total of 4 xhr types of requests, we opened the comparison one by one and clicked Preview to see what they were responding to.

Find the first request is what we are looking for.

Click Headers to see the request parameters. Such as:

Here we can make sure that the city parameter is urban, the PN parameter is the number of pages, and the KD parameter is the search keyword.

Then we start writing the code.

Code

The code is divided into four parts to facilitate later maintenance.

1. Basic HTTPS requesthttps.py

This section encapsulates some of the requests packages, some of which are as follows:

#-*-Coding:utf-8-*- fromSrc.settingImportIP, UAImportRequests, randomImportLoggingclassHttp:'''HTTP request-related operations    '''    def __init__( Self):Pass    defGet Self, URL, headers=None, cookies=None, Proxy=None, TimeOut=5, Timeoutretry=5):'''get web page source codeURL: web linkheaders:headerscookies:cookiesProxy: AgentTimeout: Request time-outtimeoutretry: Timeout retry countreturn: Source code        '''        if  notUrl:logging.error (' geterror url not exit ')return ' None '                    # only part of the code is shown here        # Full code uploaded to GitHub

Only part of the code is shown here, and the full code is uploaded to GitHub

2. Main logic part of codemain.py

This part of the program logic is as follows:

    • Get job Information
defGetInfo (URL, para):"""Get information    """Generalhttp=Http () Htmlcode=Generalhttp.post (URL, para=Para, headers=Headers, cookies=Cookies) Generalparse=Parse (Htmlcode) PageCount=Generalparse.parsepage () Info=[] forIinch Range(1,3):Print(' first%sPage ' %i) para[' PN ']= Str(i) Htmlcode=Generalhttp.post (URL, para=Para, headers=Headers, cookies=Cookies) Generalparse=Parse (Htmlcode) info=Info+Getinfodetail (generalparse) Time.sleep (2)returnInfo
    • Storage of information
defProcessInfo (info, para):"""Information Store    """Logging.error (' Process start ')Try: Title= ' Company name\ tCompany Type\ tFinancing Phase\ tlabel\ tCompany Size\ tCompany Location\ tJob Type\ tAcademic Requirements\ tWelfare\ tSalary\ tWork Experience\ n'        file =Codecs.Open('%sposition. xls ' %para[' City '],' W ',' Utf-8 ')file. Write (title) forPinchInfo:line= Str(p[' CompanyName '])+ '\ t' + Str(p[' Companytype '])+ '\ t' + Str(p[' Companystage '])+ '\ t' + \                   Str(p[' Companylabel '])+ '\ t' + Str(p[' Companysize '])+ '\ t' + Str(p[' Companydistrict '])+ '\ t' + \                   Str(p[' PositionType '])+ '\ t' + Str(p[' Positioneducation '])+ '\ t' + Str(p[' Positionadvantage '])+ '\ t' + \                   Str(p[' Positionsalary '])+ '\ t' + Str(p[' Positionworkyear '])+ '\ n'            file. Write (line)file. Close ()return True    except Exception  asE:Print(e)return None

3. Information Analysis Sectionparse.py

This section analyzes the characteristics of the job information returned by the server, as follows:

classParse:'''Parsing Web page information    '''    def __init__( Self, Htmlcode): Self. Htmlcode=Htmlcode Self. JSON=Demjson.decode (Htmlcode)Pass    defParsetool ( Self, content):'''Clear HTML Tags        '''        if type(content)!= Str:returnContent sublist=[' <p.*?> ',' </p.*?> ',' <b.*?> ',' </b.*?> ',' <div.*?> ',' </div.*?> ',' </br> ',' <br/> ',' <ul> ',' </ul> ',' <li> ',' </li> ',' <strong> ',' </strong> ',' <table.*?> ',' <tr.*?> ',' </tr> ',' <td.*?> ',' </td> ',' \ r ','\ n',' &.*?; ',' & ','#.*?;',' <em> ',' </em> ']Try: forSubstringinch[Re.Compile(String, Re.) S forStringinchSublist]: Content=Re.sub (SUBSTRING,"", content). Strip ()except:Raise Exception(' Error ' + Str(Substring.pattern))returnContent# only part of the code is shown here        # Full code uploaded to GitHub

Only part of the code is shown here, and the full code is uploaded to GitHub

4. Configuration sectionsetting.py

This part of the reason for the inclusion of cookies is to deal with the anti-crawl of the hook net, long-term use needs to be improved, dynamic cookie access

#-*-Coding:utf-8-*-# HeadersHeaders={' Host ':' www.lagou.com ',' Connection ':' keep-alive ',' Content-length ':' All ',' Origin ':' https://www.lagou.com ',' X-anit-forge-code ':' 0 ',' User-agent ':' mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 ',' Content-type ':' application/x-www-form-urlencoded; Charset=utf-8 ',' Accept ':' Application/json, Text/javascript, */*; q=0.01 ',' X-requested-with ':' XMLHttpRequest ',' X-anit-forge-token ':' None ',' Referer ':' https://www.lagou.com/jobs/list_java?city=%E5%B9%BF%E5%b7%9e&cl=false&fromsearch=true&labelwords=&suginput= ',' accept-encoding ':' gzip, deflate, BR ',' Accept-language ':' en-us,en;q=0.9,zh-cn;q=0.8,zh;q=0.7 '}
Test

Operation Result:

After the crawl is finished, the data crawled by the crawler can be seen in the SRC directory.

To this, the hook net job information crawl is completed. The full code has been uploaded to my github

Easy-to-read analysis of how to implement a small reptile in python, crawling the job information on the hook net

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.