This program is actually to imitate the user's Web Access operation.
Get a large assortment of items from the home page, and then traverse all the small categories first-level. In the end get the product list, and then traverse each product page, from the product page is to grab valid information.
Here, I summarize some of the key points so that I can use a good review later.
One, how to access the Web page?
# gets the Web page body def get_webpage (URL) based on the URL: headers = { ' user-agent ' : ' mozilla/5.0 (x11; linux i686; rv:34.0) gecko/20100101 firefox/34.0 ', ' Accept ' : ' text/html ', ' Connection ' : ' keep-alive '} try: request = urllib2. Request (url, none, headers) response = Urllib2.urlopen (request, timeout=120) webpage = Response.read () response.close () return webpage #except urllib2. Httperror, e: # print (' httperror: ' + str ( E.code)) #except urllib2. Urlerror, e: # print (' urlerror: ' + str ( E.reason)) except Exception, e: print (' Exception: ' + str (e))
The above function is to use the Uillib2.urlopen () function to obtain the URL Web page content. You can also use URLLIB2. Request (), directly with Urllib2.urlopen (). This is done in order to mimic the normal browser access operation.
Two, data saving
data is best saved as XLS file format, if not saved as CSW text format can also be saved as txt text format.
It is best to make automatic recognition based on the suffix name of the file name entered by the user.
(1) First define the function SAVE_AS_CSW (), Save_as_txt (), Save_as_xls () to implement Csw,txt,xls file format saving.
DEF SAVE_AS_CSW (prod_list, filename): if len (prod_list) == 0: return False #分类 Products Prices Contacts Mobile company Landline Fax address company URL from web line_fmt = '%s ' \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ T "%s" \ n ' lines = [] head_line = line_fmt % (' category ', ' goods ', ' price ', ' Contact ', ' mobile phone number ', ' Company ', ' phone ', ' fax ', ' Company address ', ' company URL ', ' source page ') lines.append (head_line) for item in prod_list: info = item[' Detail '] if info == None: #如果信息不全, skip continue prod_line = line_fmt % (item[' path '], info[' name '], info[' price '), info[' Contact '], info [' Cell-phone '], info[' company '], info[' tel-phone '], info[' fax '], info[' address '], info[' website '], item[' URL ']) lines.append (prod_line) wfilE = open (filename, ' W ') wfile.writelines (lines) Wfile.close () return truedef save_as_txt (prod_list, filename): if len (prod_list) == 0: return False #分类 Products Price Contact Mobile company Landline Fax address Company website from web line_fmt = '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n ' lines = [] head_line = line_fmt % (' Category ', ' merchandise ', ' price ', ' contact ', ' mobile number ', ' Company ', ' phone ', ' fax ', ' Company address ', ' company URL ', ' source page ') Lines.append (Head_line) for item in prod_list: info = item[' Detail '] if info == none: #如果信息不全, Skip continue prod_line = line_fmt % (item[' path '), info[' name '], info[' price '], info[' contact '], info[' cell-phone '], info[' company '], info[' tel-phone '], info[' fax '], info[' address '], info[' website '], item[' URL ') lines.append (prod_line) wfile = open (filename, ' W ') Wfile.writelines (lines) wfile.close () return true# Save the data into the XLS file, each large class placed in a different worksheet Def save_as_xls (prod_list, filename): if Len (prod_list) == 0: return False   WORKBOOK = XLWT. Workbook (encoding= ' utf-8 ') #必须注明编码格式, otherwise save failed curr_category = ' worksheet = None row_index = 0 for prod_item in prod_list: path = prod_item[' path '] this_category = Path.split ('/') [0] #如果当前的这个商品种类与上一个商品不同, you want to create a new worksheet if this_category != curr_category: worksheet = workbook.add_sheet (this_category) curr_category = this_category #填写表头 header_cells = (' category ', ' merchandise ', ' price ', ' contact ', ' phone number ', ' Company ', ' phone ', ' fax ', ' Company address ', ' company URL ', ' source page ') column_index = 0 for cell in header_cells: worksheet.write (0, column_index, header_cells[column_index]) column_index += 1 #创建了新了worksheet后, Data is written down from the second line row_index = 1 #将数据填写到worksheet的row_index行 prod_info = prod_item[' Detail '] #如果信息不全, Skip if prod_info == None: continue prod_cells = (Path, prod_info[' name '], prod_info[' price '], prod_info[' contact '], prod_info[' Cell-phone '], prod_ Info[' company '], prod_info[' Tel-phone '], prod_info[' Fax '], prod_info[' address '], prod_info[' website '], prod_item[' url ']) column_index = 0 for cell in prod_cells: worksheet.write (Row_index, column_index, prod_cells[column _index]) column_index += 1 row_index += 1 pass &nbSp; workbook.save (filename) return true
(2) define the Datasaver class to achieve a unified file saving function. And the case_dict are saved according to the suffix name separately.
Def get_filename_postfix (filename): basename = os.path.basename (filename ) temp = basename.split ('. ') if len (temp) >= 2: return temp[-1]class datasaver: #后缀名与保存函数映射表 case_dict = {' CSW ':save_as_csw, ' txt ':save_as_txt} if xlwt_enabled: case_dict[' xls '] = save_as_xls #将商品列表数据 ' Hello ' to Datasaver def feed (self, data): self.product_list = data pass def save_as (Self, filename): if self.product_list == none or len (self.product_list) == 0: print (' Warning: Record is empty, do not save ') return print (' Saving ... ') while True: postfix = get_filename_postfix (filename) try: if self.case_dict[postfix] (self.product_list, filename): print (' Saved to: ' + filename) else: print (' Save failed! ') break except KeyError: print (' Warning: %s file format not supported. ' % (postfix) print (' Supported file formats: ' + ', '. Join (Self.case_dict.keys ())) try: filename = raw_input (' Please enter a new file name: ') except keyboardinterrupt: print (' User cancel save ') break pass pass
(3) If you do not have XLWT installed, you cannot support the saving of XLS files.
The practice here is to add the Save function of the XLS file to case_dict if import XLWT succeeds.
If the file format is not supported, you are prompted to save the user with a different name.
#如果没有安装xlwt, then save as XLS file is not available xlwt_enabled = truetry:import xlwtexcept importerror:xlwt_enabled = False
See the processing of the XLS suffix name in the Datasaver.save_as () function.
Third, the problems encountered and solutions
(1) There is no problem with Chinese in the Python program.
Previously, only the Python program had Chinese, and no matter where it was, it could not be run. The original is the Python parser by default, the file is recognized as ASCII encoded format, Chinese of course, do not mistake. The solution to this problem is to explicitly inform the parser of the encoding format of our files.
#!/usr/bin/env python#-*-Coding=utf-8-*-
That's all you can do.
(2) Installation xlwt3 is not successful.
Download XLWT3 from the web for installation. Python setup.py install failed, the report print () function does not support print ("XXXX", file=f) format. I looked at this feature Python 2.6 is not. Then re-downloaded the xlwt-0.7.5.tar.gz for installation. The result will be.
(3) The problem is garbled in Windows.
I haven't solved the problem yet.
Summary of the first Python web crawler