This article describes using Python to crawl job site information
This crawl is the Zhaopin website search "data Analyst" after the information.
Python version: python3.5.
The main package I use is BeautifulSoup + requests+csv
In addition, I grabbed a brief description of the recruitment content.
When the file is exported to a CSV file, it is found to be garbled when opened in Excel, but it is no problem to open it with the file software (such as notepad++).
To be able to display correctly when opened with Excel, I converted the following with pandas and added the above name. Once the conversion is complete, it will be displayed correctly. For the conversion with pandas, you can refer to my blog:
As the recruitment content is more descriptive, finally save the CSV file as an Excel file and adjust the format for easy viewing.
The final effect is as follows:
The implementation code is as follows: The code for crawling information is as follows:
# Code based on Python 3.x# _*_ coding:utf-8 _*_# __author: ' LEMON ' from BS4 import beautifulsoupimport requestsimport csv def download (URL): headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) gecko/20100101 firefox/51.0 '} req = Requests.get (URL, headers=headers) return req.textdef get_content (HTML) : Soup = beautifulsoup (html, ' lxml ') BODY = soup.body Data_main = body.find (' div ', {' class ': ' Newlist_list_conten T '}) tables = Data_main.find_all (' table ') zw_list = [] for i,table in Enumerate (tables): if i = = 0: Continue temp = [] TDs = Table.find (' tr '). Find_all (' td ') ZWMC = Tds[0].find (' a '). Get_text () Zw_link = Tds[0].find (' a '). Get (' href ') FKL = Tds[1].find (' span '). Get_text () GSMC = Tds[2].find (' a '). get_t Ext () Zwyx = Tds[3].get_text () Gzdd = Tds[4].get_text () GBSJ = Tds[5].find (' span '). Get_text () Tr_brief = Table.find (' tr ', {' class ': ' Newlist_tr_detail '}) Brief = tr_brief.find (' li ', {' class ': ' Newlist_deatil_last '}). Get_text () temp.append (ZWMC) temp.append ( FKL) temp.append (GSMC) temp.append (Zwyx) temp.append (GZDD) temp.append (GBSJ) Temp.appen D (brief) temp.append (zw_link) zw_list.append (temp) return zw_listdef write_data (data, name): filename = Name with open (filename, ' a ', newline= ', encoding= ' Utf-8 ') as F:f_csv = Csv.writer (f) f_csv.writerows (data) if __name__ = = ' __main__ ': Basic_url = ' Http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%85%A8%E5%9B%BD &kw=%e6%95%b0%e6%8d%ae%e5%88%86%e6%9e%90%e5%b8%88&sm=0&p= ' number_list = list (range) # Total number of page is and number in Number_list:num = number + 1 URL = basic_url + str (num) filename = ' Z Hilian_da.csv ' html = download (URL) # print (HTML) data = get_content (HTML) # Print (data) Print (' Start saving page: ', num) write_data (data, filename)
The code for conversion with pandas is as follows:
# Code based on Python 3.x# _*_ coding:utf-8 _*_# __author: "LEMON" import pandas as Pddf = Pd.read_csv (' Zhilian_da.csv ', Header=none) Df.columns = [' Job name ', ' feedback rate ', ' Company name ', ' monthly salary ', ' work place ', ' release date ', ' recruitment profile ', ' Web link ']# Output the adjusted Dataframe file to the new CSV file Df.to_csv (' Zhilian_da_update.csv ', index=false)