Create a crawler in python and save the captured results to excel, pythonexcel

Source: Internet
Author: User

Create a crawler in python and save the captured results to excel, pythonexcel

I have been learning Python for a while, and I have learned a little about it. Today I am going to go to practical drills: compile a small crawler to pull the net salary survey through Python.

Step 1: analyze the website request process

When we look at the recruitment information on the Internet, we search for job information such as Python or PHP. In fact, we send a request to the server, and the server dynamically responds to the request, parse the content we need through a browser and present it in front of us.

We can see that the kd parameter in FormData in our request represents the recruitment information for requesting the keyword Python to the server.

Analyze complex page request and response information. We recommend using Fiddler, which is definitely a major killer for websites. However, you can easily respond to requests using the developer tools provided by the browser, such as FireBug of Firefox. Just press F12, all request information is displayed in front of you.

Through the analysis of the request and Response Process of the website, we can see that the recruitment information of the tick network is dynamically transmitted by XHR.

We found that there are two requests sent in POST mode: companyAjax. json and positionAjax. json, which control the recruitment information on the currently displayed page and page respectively.

We can see that the information we need is included in positionAjax. json Content-> result contains some other parameter information, including the total number of pages (totalPageCount), total number of recruitment registrations (totalCount) and other related information.

Step 2: send a request to obtain the page

Knowing where the information we want to capture is the most important. After knowing the information location, we should consider how to simulate the browser through Python to obtain the information we need.

Def read_page (url, page_num, keyword): # imitate the browser post requirement information and read the returned page information page_headers = {'host': 'www .lagou.com ', 'User-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 'Chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3 ', 'connection': 'Keep-alive'} if page_num = 1: boo = 'true' else: boo = 'false' page_data = parse. urlencode ([# Through PAGE analysis, it is found that the FormData submitted by the browser includes the following parameters ('first ', boo), ('pn', page_num), ('kd ', keyword)]) req = request. request (url, headers = page_headers) page = request. urlopen (req, data = page_data.encode ('utf-8 ')). read () page = page. decode ('utf-8') return page

The key step is to wrap our own requests in the form of browser Post.

The request includes the url of the webpage to be crawled and the headers used for camouflage. The data parameter in urlopen includes the first, pn, and kd parameters of FormData)

After packaging, you can access the Web hook and obtain page data like a browser.

Step 3: obtain required data

After obtaining the page information, we can start the most important step in crawling data: capturing data.

There are many ways to capture data, such as the regular expression re, lxml etree, json, and bs4 BeautifulSoup, which are suitable for python3 to capture data. You can use one or more of them based on your actual situation.

Def read_tag (page, tag): page_json = json. loads (page) page_json = page_json ['content'] ['result'] # The json information obtained through analysis shows that the recruitment information is included in the returned result, it contains many other parameters page_result = [num for num in range (15)] # constructs a 15-point placeholder list, used to construct the next two-dimensional array for I in range (15): page_result [I] = [] # To construct a two-dimensional array for page_tag in tag: page_result [I]. append (page_json [I]. get (page_tag) # traverse parameters and place them in the same list. page_result [I] [8] = ','. join (page_result [I] [8]) return page_result # return the recruitment information on the current page

Step 4: store the captured information in excel

After obtaining the raw data, we have structured and organized the captured data to be stored in excel for further sorting and analysis, facilitating data visualization.

Here I use two different frameworks: the old xlwt. Workbook and xlsxwriter.

Def save_excel (fin_result, tag_name, file_name): book = Workbook (encoding = 'utf-8') tmp = book. add_sheet ('sheet ') times = len (fin_result) + 1 for I in range (times): # I represents rows, I + 1 indicates the first line information if I = 0: for tag_name_ I in tag_name: tmp. write (I, tag_name.index (tag_name_ I), tag_name_ I) else: for tag_list in range (len (tag_name): tmp. write (I, tag_list, str (fin_result [I-1] [tag_list]) book. save (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name)

The first is xlwt. I don't know why, after xlwt stores more than 100 pieces of data, it will not be fully stored, and the excel file will also see "some content is wrong, you need to fix it. "I checked it many times and initially thought it was a storage problem caused by incomplete data capture. After the breakpoint check, we found that the data is complete. Later, I changed the local data for processing, and there was no problem. My mood was like this:

I haven't figured it out yet. I hope I can tell you something)

Def save_excel (fin_result, tag_name, file_name): # store the collected recruitment information in excel. book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_result) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_result [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close ()

This is the data stored in xlsxwriter. It can be used normally.

So far, a small crawler that captures the recruitment information of the hook net has been born.

Add source code

#! -*-Coding: UTF-8-*-from urllib import request, parsefrom bs4 import BeautifulSoup as BSimport jsonimport datetimeimport xlsxwriterstarttime = datetime. datetime. now () url = r'http: // www.lagou.com/jobs/positionAjax.json? City = % E5 % 8C % 97% E4 % BA % AC' # The recruitment information of the Referer network is dynamically obtained. Therefore, json information must be submitted through post, the default city is Beijing tag = ['companyname', 'company', 'positionname', 'eduation', 'salary ', 'financestream', 'companysize', 'industryfield ', 'companylabellist'] # This is the tag information to be crawled, including the company name, educational requirements, salary, etc. tag_name = ['Company name', 'Company abbreviation ', 'position name ', 'degree required ', 'wage', 'Company qualification ', 'Company size', 'category', 'Company introduction'] def read_page (url, page_num, keyword ): # imitate the browser post requirement information and read the returned page information page_headers = {'host': 'www .lagou.com ', 'user-agent ': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''chrome/45.0.2454.85 Safari/537.36 115 Browser/6.0.3', 'connection ': 'Keep-alive'} if page_num = 1: boo = 'true' else: boo = 'false' page_data = parse. urlencode ([# Through PAGE analysis, it is found that the FormData submitted by the browser includes the following parameters ('first ', boo), ('pn', page_num), ('kd ', keyword)]) req = request. request (url, headers = page_headers) page = request. urlopen (req, data = page_data.encode ('utf-8 ')). read () page = page. decode ('utf-8') return pagedef read_tag (page, tag): page_json = json. loads (page) page_json = page_json ['content'] ['result'] # The json information obtained through analysis shows that the recruitment information is included in the returned result, it contains many other parameters page_result = [num for num in range (15)] # construct a list placeholder with a capacity of 15, used to construct the next two-dimensional array for I in range (15): page_result [I] = [] # To construct a two-dimensional array for page_tag in tag: page_result [I]. append (page_json [I]. get (page_tag) # traverse parameters and place them in the same list. page_result [I] [8] = ','. join (page_result [I] [8]) return page_result # return the current page's recruitment information def read_max_page (page): # obtain the maximum number of pages of the current recruitment keyword, more than 30 will be overwritten, so you can only capture 30 pages of recruitment information page_json = json. loads (page) max_page_num = page_json ['content'] ['totalpagecount'] if max_page_num> 30: max_page_num = 30 return max_page_numdef save_excel (fin_result, tag_name, file_name ): # store the collected recruitment information in excel. book = xlsxwriter. workbook (r 'C: \ Users \ Administrator \ Desktop \ eclips.xls '% file_name) # It is stored on the Desktop tmp = book by default. add_worksheet () row_num = len (fin_result) for I in range (1, row_num): if I = 1: tag_pos = 'a % s' % I tmp. write_row (tag_pos, tag_name) else: con_pos = 'a % s' % I content = fin_result [I-1] #-1 is because the table header occupies tmp. write_row (con_pos, content) book. close () if _ name _ = '_ main __': print ('********************************** to be captured soon **********************************') keyword = input ('Enter the language type you want to search :') fin_result = [] # combine the recruitment information on each page into a final recruitment information max_page_num = read_max_page (read_page (url, 1, keyword) for page_num in range (1, max_page_num): print ('******************************** is downloading content on the % s page. * ********************************* '% page_num) page = read_page (url, page_num, keyword) page_result = read_tag (page, tag) fin_result.extend (page_result) file_name = input ('capture completed, input file name saved: ') save_excel (fin_result, tag_name, file_name) endtime = datetime. datetime. now () time = (endtime-starttime ). seconds print ('total time: % s' % time)

There are also many features that can be added. For example, you can modify the city parameter to view recruitment information for different cities. You can develop it on your own. Here we will only use it for discussion,

Articles you may be interested in:
  • Python read/write Excel files
  • Python simulates Sina Weibo login (Sina Weibo crawler)
  • Install and use the Python crawler framework Scrapy
  • Use Python to write simple web crawlers to capture video download resources
  • Full record of crawler writing for python Crawlers
  • Python implements simple crawler sharing for crawling links on pages
  • Detailed description of using xlrd and xlwt to Operate excel tables in python
  • Python3 simple Crawler
  • Write crawler programs in python
  • Multi-thread web crawler using python
  • Python crawler simulated logon website with verification code
  • Python crawler tools
  • Write simple Weibo crawlers in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.