Create a crawler using python and save the crawled results to excel.

Source: Internet
Author: User
This article records how to use Python to create a crawler to crawl and pull the net information and save the results to the Excel file. It also contains the final source code, if you have a need, you can refer to it for a while to learn Python. In general, you have learned a little about all kinds of theoretical knowledge. Today we will go to practical drills: use Python to write a small crawler that pulls the net salary survey.

Step 1: analyze the website request process

When we look at the recruitment information on the Internet, we search for job information such as Python or PHP. In fact, we send a request to the server, and the server dynamically responds to the request, parse the content we need through a browser and present it in front of us.

We can see that the kd parameter in FormData in our request represents the recruitment information for requesting the keyword Python to the server.

Analyze complex page request and response information. We recommend using Fiddler, which is definitely a major killer for websites. However, you can easily respond to requests using the developer tools provided by the browser, such as FireBug of Firefox. Just press F12, all request information is displayed in front of you.

Through the analysis of the request and Response Process of the website, we can see that the recruitment information of the tick network is dynamically transmitted by XHR.

We found that there are two requests sent in POST mode: companyAjax. json and positionAjax. json, which control the recruitment information on the currently displayed page and page respectively.

We can see that the information we need is included in positionAjax. json Content-> result contains some other parameter information, including the total number of pages (totalPageCount), total number of recruitment registrations (totalCount) and other related information.

Step 2: send a request to obtain the page

Knowing where the information we want to capture is the most important. After knowing the information location, we should consider how to simulate the browser through Python to obtain the information we need.

Def read_page (url, page_num, keyword): # imitate the browser post requirement information and read the returned page information page_headers = {'host': 'www .lagou.com ', 'User-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'Chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'connection': 'Keep-alive'} if page_num = 1: boo = 'true' else: boo = 'false' page_data = parse. urlencode ([# Through PAGE analysis, it is found that the FormData submitted by the browser includes the following parameters ('first ', boo), ('pn', page_num), ('kd ', keyword)]) req = request. request (url, headers = page_headers) page = request. urlopen (req, data = page_data.encode ('utf-8 ')). read () page = page. decode ('utf-8') return page

The key step is to wrap our own requests in the form of browser Post.

The request includes the url of the webpage to be crawled and the headers used for camouflage. The data parameter in urlopen includes the first, pn, and kd parameters of FormData)

After packaging, you can access the Web hook and obtain page data like a browser.

Step 3: obtain required data

After obtaining the page information, we can start the most important step in crawling data: capturing data.

There are many ways to capture data, such as the regular expression re, lxml etree, json, and bs4 BeautifulSoup, which are suitable for python3 to capture data. You can use one or more of them based on your actual situation.

Def read_tag (page, tag): page_json = json. loads (page) page_json = page_json ['content'] ['result'] # The json information obtained through analysis shows that the recruitment information is included in the returned result, it contains many other parameters page_result = [num for num in range (15)] # constructs a 15-point placeholder list, used to construct the next two-dimensional array for I in range (15): page_result [I] = [] # To construct a two-dimensional array for page_tag in tag: page_result [I]. append (page_json [I]. get (page_tag) # traverse parameters and place them in the same list. page_result [I] [8] = ','. join (page_result [I] [8]) return page_result # return the recruitment information on the current page

Step 4: store the captured information in excel

After obtaining the raw data, we have structured and organized the captured data to be stored in excel for further sorting and analysis, facilitating data visualization.

Here I use two different frameworks: the old xlwt. Workbook and xlsxwriter.

Def save_excel (fin_result, tag_name, file_name): book = Workbook (encoding = 'utf-8') tmp = book. add_sheet ('sheet ') times = len (fin_result) + 1 for I in range (times): # I represents rows, I + 1 indicates the first line information if I = 0: for tag_name_ I in tag_name: tmp. write (I, tag_name.index (tag_name_ I), tag_name_ I) else: for tag_list in range (len (tag_name): tmp. write (I, tag_list, str (fin_result [I-1] [tag_list]) book. save (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name)

The first is xlwt. I don't know why, after xlwt stores more than 100 pieces of data, it will not be fully stored, and the excel file will also see "some content is wrong, you need to fix it. "I checked it many times and initially thought it was a storage problem caused by incomplete data capture. After the breakpoint check, we found that the data is complete. Later, I changed the local data for processing, and there was no problem. My mood was like this:

I haven't figured it out yet. I hope I can tell you something)

Def save_excel (fin_result, tag_name, file_name): # store the collected recruitment information in excel. book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_result) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_result [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close ()

This is the data stored in xlsxwriter. It can be used normally.

So far, a small crawler that captures the recruitment information of the hook net has been born.

Add source code

#! -*-Coding: UTF-8-*-from urllib import request, parsefrom bs4 import BeautifulSoup as BSimport jsonimport datetimeimport xlsxwriterstarttime = datetime. datetime. now () url = R' http://www.lagou.com/jobs/positionAjax.json?city=%E5%8C%97%E4%BA%AC '# The recruitment information of the Referer network is dynamically obtained. Therefore, json information needs to be submitted through post. The default city is Beijing tag = ['companyname', 'companyname', 'positionname ', 'education', 'salary ', 'financestream', 'companysize', 'industryfield', 'companylabellist'] # This is the tag information to be crawled, including the company name, educational requirements, salary, etc. tag_name = ['Company name', 'Company abbreviation ', 'position name', 'required degree', 'salary ', 'Company qualification', 'Company scale ', 'category ', 'Company introduction'] def read_page (url, page_num, keyword): # imitating browser post requirement information, and read the returned page information page_headers = {'host': 'www .lagou.com ', 'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) ''chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'connection': 'Keep-alive'} if page_num = 1: boo = 'true' else: boo = 'false' page_data = parse. urlencode ([# Through PAGE analysis, it is found that the FormData submitted by the browser includes the following parameters ('first ', boo), ('pn', page_num), ('kd ', keyword)]) req = request. request (url, headers = page_headers) page = request. urlopen (req, data = page_data.encode ('utf-8 ')). read () page = page. decode ('utf-8') return pagedef read_tag (page, tag): page_json = json. loads (page) page_json = page_json ['content'] ['result'] # The json information obtained through analysis shows that the recruitment information is included in the returned result, it contains many other parameters page_result = [num for num in range (15)] # construct a list placeholder with a capacity of 15, used to construct the next two-dimensional array for I in range (15): page_result [I] = [] # To construct a two-dimensional array for page_tag in tag: page_result [I]. append (page_json [I]. get (page_tag) # traverse parameters and place them in the same list. page_result [I] [8] = ','. join (page_result [I] [8]) return page_result # return the current page's recruitment information def read_max_page (page): # obtain the maximum number of pages of the current recruitment keyword, more than 30 will be overwritten, so you can only capture 30 pages of recruitment information page_json = json. loads (page) max_page_num = page_json ['content'] ['totalpagecount'] if max_page_num> 30: max_page_num = 30 return max_page_numdef save_excel (fin_result, tag_name, file_name ): # store the collected recruitment information in excel. book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_result) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_result [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close () if _ name _ = '_ main __': print ('********************************** to be captured soon **********************************') keyword = input ('Enter the language type you want to search :') fin_result = [] # combine the recruitment information on each page into a final recruitment information max_page_num = read_max_page (read_page (url, 1, keyword) for page_num in range (1, max_page_num): print ('******************************** is downloading content on the % s page. * ********************************* '% page_num) page = read_page (url, page_num, keyword) page_result = read_tag (page, tag) fin_result.extend (page_result) file_name = input ('capture completed, input file name saved: ') save_excel (fin_result, tag_name, file_name) endtime = datetime. datetime. now () time = (endtime-starttime ). seconds print ('total time: % s' % time)

There are also many features that can be added. For example, you can modify the city parameter to view recruitment information for different cities. You can develop it on your own. Here we will only use it for discussion,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.