Python3 crawls and hooks recruitment data and python3 Crawlers

Source: Internet
Author: User

Python3 crawls and hooks recruitment data and python3 Crawlers


Use python to crawl and pull data
Step 1: download the required modules
Requests Enter cmd command: pip install requests press enter to automatically download online
Run the command pip install xlwt and press enter to enable automatic download.
Step 2: Find the web page you want to crawl (I am crawling the web page)
Select a browser (Firefox, Google) to capture packets using Google
Encoding tool (idea) (pyCharm) I use idea

Import requests # import the downloaded requestaimport xlwt # import the downloaded xlwt # use Google to find the corresponding webpage and press f12 To Go To The check page
# NetWork, XHR contains headers to find
Headers = {# first computer and server information, 'user-agent': 'mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/59.0.3071.115 Safari/537.36 ', # computer and server information # the second one is where you access the hook network, without which you think it is robot access. 'Referer': 'https: // www.lagou.com/jobs/list_python? LabelWords = & fromSearch = true & suginput = ', # The third is to identify the identity of some websites, and some do not need 'cookies': 'user _ trace_token = token; LGUID = token; index_location_city = % E5 % 85% A8 % E5 % 9B % BD; JSESSIONID = Beijing; _ gat = 1; PRE_UTM =; PRE_HOST = www.sogou.com; PRE_SITE = https % 3A % 2F % 2Fwww.sogou.com % 2 Flink % 3 Furl % 3 DhedJ JaC291NlQquFD-D9iKfCABISWiMgDLW1Nx6fG3psqHL_zYlG_a3mlRzfPLR2; small; TG-TRACK-CODE = index_search; _ gid = small; _ ga = ga1.2.1930895945.1505957133; small = 1505957579,1505957596, 1505957630,1505969456; small = 1505969469; LGSID = Small-Medium- 525400f775ce; LGRID = paths; SEARCH_ID = '#, from which page you enter} # data corresponds to a page whose pn is 1, which is equivalent to def getJobList (page) on the first page): data = {'first': 'false', 'pn ': page, 'kd': 'python'} # initiate a post request, the URL of the current webpage. res = requests. post ('https: // www.lagou.com/jobs/positionAjax.json? NeedAddtionalResult = ''false & isSchoolJob = 0', data = data, headers = headers) result = res. json () # display the data in json format similar to (key, value) jobs = result ['content'] ['positionresult'] ['result'] # return jobs for each corresponding query # return results excelTabel = xlwt. workbook () # create an excel Object sheet1 = excelTabel. add_sheet ('lagou ', cell_overwrite_ OK = True) sheet1.write (, 'Company name') # company name sheet1.write (, 'city') # city sheet1.write (, 'region ') # region sheet1.write (, 'full-time/00') # full-time/simply sheet1.write (, 'payroll ') # salary sheet1.write (, 'post') # position sheet1.write, 'years of Service ') # years of service sheet1.write (, 'Company size') # company scale sheet1.write (, 'diploma') # education level n = 1for page in range ): # cyclically output each page for job in getJobList (page = page ): # The following if judgment can be added or not added: if '1-3' in job ['workyear'] and 'backend developer' in job ['secondtype'] and 'bachelor' in job ['education']: # and 'chaoyang district 'in job ['District'] sheet1.write (n, 0, job ['companyfullname']) # company name sheet1.write (n, 1, job ['city']) # city sheet1.write (n, 2, job ['District ']) # region sheet1.write (n, 3, job ['jobnature']) # full-time/simply sheet1.write (n, 4, job ['salary ']) # salary sheet1.write (n, 5, job ['secondtype']) # job sheet1.write (n, 6, job ['workyear']) # sheet1.write (n, 7, job ['companysize']) # company scale sheet1.write (n, 8, job ['education']) # educational qualifications ')

In fact, I don't know how to insert the image list,

However, you can copy the code above to crawl the data and then study it slowly (the headers can be changed based on individual differences)

Python3:

Input and Output

Str (): The function returns a user-readable expression.

Str. format () replaces {} in the output statement and concatenates it with other strings.

Repr (): generate an easy-to-read parser expression

The repr () function can escape special characters in a string.

The repr () parameter can be any python object.

Read and Write files

Open (filename, mode) will return a file object

Filename: the variable is a string containing the file name you want to access.

Mode: determines the file opening mode. The default mode is read-only.

F = open ('C \ foo.txt ', w ):

Str = f. read ()

Print (str)

F. close (): close open files

F. readline (): reads a separate row from the file.

F. readlines (): returns all rows contained in the file.

F. write ('aaa'): writes aaaa to the file, and returns the number of characters written to the file.

F. tell () returns the location of the current object

F. seek (): changes the current file location

        

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.