Summary of the first Python web crawler

Last Update:2014-12-15 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This program is actually to imitate the user's Web Access operation.
Get a large assortment of items from the home page, and then traverse all the small categories first-level. In the end get the product list, and then traverse each product page, from the product page is to grab valid information.

Here, I summarize some of the key points so that I can use a good review later.

One, how to access the Web page?

#  gets the Web page body def get_webpage (URL) based on the URL:    headers = {              ' user-agent '  :  ' mozilla/5.0  (x11;  linux i686; rv:34.0)  gecko/20100101 firefox/34.0 ',              ' Accept '      :  ' text/html ',              ' Connection '  :  ' keep-alive '}     try:        request = urllib2. Request (url, none, headers)         response =  Urllib2.urlopen (request, timeout=120)         webpage =  Response.read ()         response.close ()          return webpage     #except  urllib2. Httperror, e:    #    print (' httperror:  '  + str ( E.code))      #except  urllib2. Urlerror, e:    #    print (' urlerror:  '  + str ( E.reason))     except Exception, e:         print (' Exception:  '  + str (e))

The above function is to use the Uillib2.urlopen () function to obtain the URL Web page content. You can also use URLLIB2. Request (), directly with Urllib2.urlopen (). This is done in order to mimic the normal browser access operation.

Two, data saving
data is best saved as XLS file format, if not saved as CSW text format can also be saved as txt text format.
It is best to make automatic recognition based on the suffix name of the file name entered by the user.

(1) First define the function SAVE_AS_CSW (), Save_as_txt (), Save_as_xls () to implement Csw,txt,xls file format saving.

DEF&NBSP;SAVE_AS_CSW (prod_list, filename):     if len (prod_list)  == 0:         return False     #分类   Products   Prices   Contacts   Mobile   company   Landline   Fax   address   company URL   from web      line_fmt  =  '%s ' \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ n '     lines  = []    head_line = line_fmt %  (' category ',  ' goods ',  ' price ',   ' Contact ',  ' mobile phone number ', ' Company ',                               ' phone ',  ' fax ',  ' Company address ',  ' company URL ',  ' source page ')     lines.append (head_line)      for item in prod_list:        info =  item[' Detail ']        if info == None:     #如果信息不全, skip              continue         prod_line = line_fmt %  (item[' path '], info[' name '], info[' price '),                                  info[' Contact '], info [' Cell-phone '], info[' company '],                                   info[' tel-phone '], info[' fax '], info[' address '], info[' website '], item[' URL '])         lines.append (prod_line)          wfilE = open (filename,  ' W ')     wfile.writelines (lines)      Wfile.close ()     return truedef save_as_txt (prod_list, filename):     if len (prod_list)  == 0:        return  False     #分类   Products   Price   Contact   Mobile   company   Landline   Fax   address   Company website   from web      line_fmt =  '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n '     lines = []    head_line = line_fmt %   (' Category ',  ' merchandise ',  ' price ',  ' contact ',  ' mobile number ', ' Company ',                                ' phone ',  ' fax ',  ' Company address ',  ' company URL ',  ' source page ')      Lines.append (Head_line)     for item in prod_list:        info =  item[' Detail ']        if info == none:      #如果信息不全, Skip             continue         prod_line = line_fmt %  (item[' path '),  info[' name '], info[' price '],                                  info[' contact '], info[' cell-phone '], info[' company '],                                   info[' tel-phone '], info[' fax '],  info[' address '], info[' website '], item[' URL ')         lines.append (prod_line)          wfile = open (filename,  ' W ')      Wfile.writelines (lines)     wfile.close ()     return true# Save the data into the XLS file, each large class placed in a different worksheet Def save_as_xls (prod_list, filename):    if  Len (prod_list)  == 0:        return False   &NBSP;&NBSP;WORKBOOK&NBSP;=&NBSP;XLWT. Workbook (encoding= ' utf-8 ')    #必须注明编码格式, otherwise save failed     curr_category =  '     worksheet = None    row_index = 0     for prod_item in prod_list:        path  = prod_item[' path ']        this_category =  Path.split ('/') [0]         #如果当前的这个商品种类与上一个商品不同, you want to create a new worksheet         if this_category != curr_category:             worksheet = workbook.add_sheet (this_category)              curr_category = this_category              #填写表头              header_cells =  (' category ',  ' merchandise ',  ' price ',  ' contact ',  ' phone number ', ' Company ',                        ' phone ',  ' fax ',  ' Company address ',  ' company URL ',  ' source page ')              column_index = 0             for cell in header_cells:                 worksheet.write (0, column_index, header_cells[column_index])                 column_index  += 1             #创建了新了worksheet后, Data is written down from the second line             row_index = 1           #将数据填写到worksheet的row_index行          prod_info = prod_item[' Detail ']          #如果信息不全, Skip         if prod_info == None:             continue         prod_cells =  (Path, prod_info[' name '], prod_info[' price '], prod_info[' contact '],                  prod_info[' Cell-phone '], prod_ Info[' company '], prod_info[' Tel-phone '],                  prod_info[' Fax '], prod_info[' address '], prod_info[' website '], prod_item[' url '])         column_index = 0         for cell in prod_cells:             worksheet.write (Row_index, column_index, prod_cells[column _index])             column_index += 1         row_index += 1         pass  &nbSp; workbook.save (filename)     return true

(2) define the Datasaver class to achieve a unified file saving function. And the case_dict are saved according to the suffix name separately.

Def get_filename_postfix (filename):     basename = os.path.basename (filename )     temp = basename.split ('. ')     if len (temp)  >= 2:         return temp[-1]class datasaver:     #后缀名与保存函数映射表     case_dict  = {' CSW ':save_as_csw,                   ' txt ':save_as_txt}    if xlwt_enabled:         case_dict[' xls '] = save_as_xls     #将商品列表数据 ' Hello ' to Datasaver    def feed (self, data):         self.product_list = data        pass     def save_as (Self, filename):        if self.product_list == none or len (self.product_list)  ==  0:            print (' Warning: Record is empty, do not save ')              return         print (' Saving ... ')         while True:             postfix = get_filename_postfix (filename)              try:                 if self.case_dict[postfix] (self.product_list,  filename):                     print (' Saved to: '  + filename)                  else:                     print (' Save failed! ')                 break             except KeyError:                 print (' Warning:  %s  file format not supported. '  %  (postfix)                  print (' Supported file formats: '  +  ', '. Join (Self.case_dict.keys ()))                  try:                     filename = raw_input (' Please enter a new file name: ')                 except keyboardinterrupt:                     print (' User cancel save ')                      break         pass    pass

(3) If you do not have XLWT installed, you cannot support the saving of XLS files.
The practice here is to add the Save function of the XLS file to case_dict if import XLWT succeeds.
If the file format is not supported, you are prompted to save the user with a different name.

#如果没有安装xlwt, then save as XLS file is not available xlwt_enabled = truetry:import xlwtexcept importerror:xlwt_enabled = False

See the processing of the XLS suffix name in the Datasaver.save_as () function.

Third, the problems encountered and solutions

(1) There is no problem with Chinese in the Python program.
Previously, only the Python program had Chinese, and no matter where it was, it could not be run. The original is the Python parser by default, the file is recognized as ASCII encoded format, Chinese of course, do not mistake. The solution to this problem is to explicitly inform the parser of the encoding format of our files.

#!/usr/bin/env python#-*-Coding=utf-8-*-

That's all you can do.

(2) Installation xlwt3 is not successful.
Download XLWT3 from the web for installation. Python setup.py install failed, the report print () function does not support print ("XXXX", file=f) format. I looked at this feature Python 2.6 is not. Then re-downloaded the xlwt-0.7.5.tar.gz for installation. The result will be.

(3) The problem is garbled in Windows.
I haven't solved the problem yet.

Summary of the first Python web crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More