Environment: Python3 Pycharm
Module: Requests,xlwt,urllib.request,re
Normal three-step walk:
1. Get the source code
2. Match the source code to get the target data
3. Save to File
Directly on the code, listing two ways to get the source code, the way to store 3 files. You can choose freely.
The first red part of the quotation marks inside the site URL, too long not posted up. Find the way: Baidu HRM official website, search python, click on page 2, the Address bar address paste into single quotation marks inside. Find the 2.html section and replace 2 with {}.
The second red part gets the number of pages that you want to get the data from and fill it out according to your needs.
#Import RequestsImportRe#for regular matching#Import XLWT #excel表格需要用到Importurllib.request#1. Using the requests module to obtain the HTML source page#def get_content (page):#url = '. Format (page)#html = requests.get (URL). Content.decode (' GBK ')#return HTML#1. Get the source code with the Urllib moduledefget_content (page): URL = ' ' . Format (page)------------1 HTML= Urllib.request.urlopen (URL). read (). Decode ('GBK') returnHTML#2. Get position, salary, company namedefget_data (HTML): Reg= Re.compile (r'class= "T1". *?<a target= "_blank" title= "(. *?)". *?<span class= "T2" ><a target= "_blank"'R'title= "(. *?)". *?<span class= "T3" > (. *?) </span>.*?<span class= "T4" > (. *?) </span>.*?'R'<span class= "T5" > (. *?) </span>', Re. S) Items=Re.findall (reg,html)returnItems#3. Store in a. csv filedefsave_file_csv (items):Importcsv Csv_file= Open ('Job.csv','W', newline="') Writer=Csv.writer (Csv_file) Writer.writerow (('Position name','Company Name','Company Address','Salary','Date')) forIteminchItems:writer.writerow (item)#3. Store in Excel table#def save_file_excel (items):#newtable = ' Jobs.xls '#WB = XLWT. Workbook (encoding= ' utf-8 ') #创建excel文件#ws = Wb.add_sheet (' job ') #去创建表#headdata = [' Job name ', ' Company name ', ' Company address ', ' salary ', ' date ']#index = 1#For Colnum in range (5):#ws.write (0,COLNUM,HEADDATA[COLNUM],XLWT.EASYXF (' Font:bold on '))#For item in items:#For J in range (Len (item)):#Ws.write (Index,j,item[j])#Index + = 1#Wb.save (newtable)#3. Store in TXT file#def save_file_txt (items):#with open (' Job.txt ', ' W ') as F:#For item in items:#For J in range (Len (item)):#F.write (Item[j])#f.write (")#f.write (' \ n ')if __name__=='__main__': forIinchRange (1,3): ---------------2 HTML=get_content (i) Items=get_data (HTML) save_file_csv (items)
Fourth-crawl HRM python related work