Capture job-seeking website information using python, and capture website information using python
This is the information captured after the Zhaopin recruitment website searches for "data analysts.
Python version: python3.5.
My main package is Beautifulsoup + Requests + csv.
In addition, I captured the simple description of the recruitment content.
After the file is output to the csv file, it is found that it is garbled when opened in excel, but it is no problem to open it with file software (such as notepad ++.
In order to display it correctly when it can be opened in Excel, I used pandas to convert the following and add the column name. After the conversion, it will be displayed correctly. For pandas conversion, refer to my blog:
Because there are many descriptions of the recruitment content, save the csv file as an excel file and adjust the following format for ease of viewing.
The final effect is as follows:
The implementation code is as follows:
1 # Code based on Python 3.x 2 # _*_ coding: utf-8 _*_ 3 # __Author: "LEMON" 4 5 6 from bs4 import BeautifulSoup 7 import requests 8 import csv 9 10 11 def download(url):12 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0'}13 req = requests.get(url, headers=headers)14 return req.text15 16 17 def get_content(html):18 soup = BeautifulSoup(html, 'lxml')19 body = soup.body20 data_main = body.find('div', {'class': 'newlist_list_content'})21 tables = data_main.find_all('table')22 23 zw_list = []24 for i,table in enumerate(tables):25 if i == 0:26 continue27 temp = []28 tds = table.find('tr').find_all('td')29 zwmc = tds[0].find('a').get_text()30 zw_link = tds[0].find('a').get('href')31 fkl = tds[1].find('span').get_text()32 gsmc = tds[2].find('a').get_text()33 zwyx = tds[3].get_text()34 gzdd = tds[4].get_text()35 gbsj = tds[5].find('span').get_text()36 37 tr_brief = table.find('tr', {'class': 'newlist_tr_detail'})38 brief = tr_brief.find('li', {'class': 'newlist_deatil_last'}).get_text()39 40 temp.append(zwmc)41 temp.append(fkl)42 temp.append(gsmc)43 temp.append(zwyx)44 temp.append(gzdd)45 temp.append(gbsj)46 temp.append(brief)47 temp.append(zw_link)48 49 zw_list.append(temp)50 return zw_list51 52 53 def write_data(data, name):54 filename = name55 with open(filename, 'a', newline='', encoding='utf-8') as f:56 f_csv = csv.writer(f)57 f_csv.writerows(data)58 59 if __name__ == '__main__':60 61 basic_url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%85%A8%E5%9B%BD&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88&sm=0&p='62 63 number_list = list(range(90)) # total number of page is 9064 for number in number_list:65 num = number + 166 url = basic_url + str(num)67 filename = 'zhilian_DA.csv'68 html = download(url)69 # print(html)70 data = get_content(html)71 # print(data)72 print('start saving page:', num)73 write_data(data, filename)
The Code converted using pandas is as follows:
1 # Code based on Python 3.x 2 # _ * _ coding: UTF-8 _ * _ 3 # _ Author: "LEMON" 4 5 import pandas as pd 6 7 df = pd.read_csv('zhilian_DA.csv ', header = None) 8 9 10 df. columns = ['position name', 'feedback rate', 'Company name', 'monthly salary ', 'workplace', 11' release date ', 'recruitment introduction ', 'webpage link'] 12 13 # output the adjusted dataframe file to the new csv file 14 df.to_csv('zhilian_DA_update.csv ', index = False)