Capture job-seeking website information using python, and capture website information using python

Source: Internet
Author: User

Capture job-seeking website information using python, and capture website information using python

This is the information captured after the Zhaopin recruitment website searches for "data analysts.

 

Python version: python3.5.

My main package is Beautifulsoup + Requests + csv.

In addition, I captured the simple description of the recruitment content.

 

After the file is output to the csv file, it is found that it is garbled when opened in excel, but it is no problem to open it with file software (such as notepad ++.

In order to display it correctly when it can be opened in Excel, I used pandas to convert the following and add the column name. After the conversion, it will be displayed correctly. For pandas conversion, refer to my blog:

Because there are many descriptions of the recruitment content, save the csv file as an excel file and adjust the following format for ease of viewing.

 

The final effect is as follows:

 

The implementation code is as follows:

 1 # Code based on Python 3.x 2 # _*_ coding: utf-8 _*_ 3 # __Author: "LEMON" 4  5  6 from bs4 import BeautifulSoup 7 import requests 8 import csv 9 10 11 def download(url):12     headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0'}13     req = requests.get(url, headers=headers)14     return req.text15 16 17 def get_content(html):18     soup = BeautifulSoup(html, 'lxml')19     body = soup.body20     data_main = body.find('div', {'class': 'newlist_list_content'})21     tables = data_main.find_all('table')22 23     zw_list = []24     for i,table in enumerate(tables):25         if i == 0:26             continue27         temp = []28         tds = table.find('tr').find_all('td')29         zwmc = tds[0].find('a').get_text()30         zw_link = tds[0].find('a').get('href')31         fkl = tds[1].find('span').get_text()32         gsmc = tds[2].find('a').get_text()33         zwyx = tds[3].get_text()34         gzdd = tds[4].get_text()35         gbsj = tds[5].find('span').get_text()36 37         tr_brief = table.find('tr', {'class': 'newlist_tr_detail'})38         brief = tr_brief.find('li', {'class': 'newlist_deatil_last'}).get_text()39 40         temp.append(zwmc)41         temp.append(fkl)42         temp.append(gsmc)43         temp.append(zwyx)44         temp.append(gzdd)45         temp.append(gbsj)46         temp.append(brief)47         temp.append(zw_link)48 49         zw_list.append(temp)50     return zw_list51 52 53 def write_data(data, name):54     filename = name55     with open(filename, 'a', newline='', encoding='utf-8') as f:56         f_csv = csv.writer(f)57         f_csv.writerows(data)58 59 if __name__ == '__main__':60 61     basic_url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%85%A8%E5%9B%BD&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88&sm=0&p='62 63     number_list = list(range(90)) # total number of page is 9064     for number in number_list:65         num = number + 166         url = basic_url + str(num)67         filename = 'zhilian_DA.csv'68         html = download(url)69         # print(html)70         data = get_content(html)71         # print(data)72         print('start saving page:', num)73         write_data(data, filename)

 

The Code converted using pandas is as follows:

1 # Code based on Python 3.x 2 # _ * _ coding: UTF-8 _ * _ 3 # _ Author: "LEMON" 4 5 import pandas as pd 6 7 df = pd.read_csv('zhilian_DA.csv ', header = None) 8 9 10 df. columns = ['position name', 'feedback rate', 'Company name', 'monthly salary ', 'workplace', 11' release date ', 'recruitment introduction ', 'webpage link'] 12 13 # output the adjusted dataframe file to the new csv file 14 df.to_csv('zhilian_DA_update.csv ', index = False)

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.