[Python Data Analysis] solve and optimize some problems in Python3 Excel (2), pythonpython3
After the previous article titled "Python Data Analysis" and "Excel in Python3"-taking Douban book Top250 as an example to crawl the top P250 of Douban books, some problems have not been solved, so we have further discussions, thanks for the help and inspiration of NIMA!
The previous issue is as follows:
1. Unable to continue writing
2. output correct results in Python IDLE, and garbled characters are written into excel.
The above two problems prompted me to change the excel processing module, because xlwt is said to only support Excel 2003, and there may be problems.
Although "One Nima" provides a Validate function, it is a function for removing invalid characters from the file name in Windows. It has nothing to do with writing garbled characters in excel, so we should consider replacing the module.
Replace the xlsxwriter Module
This time I changed to xlsxwriter this module, https://pypi.python.org/pypi/XlsxWriter. Also can pip3 install xlsxwriter, automatic download installation, simple and easy. Some usage examples:
import xlsxwriter# Create an new Excel file and add a worksheet.workbook = xlsxwriter.Workbook('demo.xlsx')worksheet = workbook.add_worksheet()# Widen the first column to make the text clearer.worksheet.set_column('A:A', 20)# Add a bold format to use to highlight cells.bold = workbook.add_format({'bold': True})# Write some simple text.worksheet.write('A1', 'Hello')# Text with formatting.worksheet.write('A2', 'World', bold)# Write some numbers, with row/column notation.worksheet.write(2, 0, 123)worksheet.write(3, 0, 123.456)# Insert an image.worksheet.insert_image('B5', 'logo.png')workbook.close()
Replace the code written into excel. The effect is as follows:
Sure enough, the nose is the face, the face is the face, the link is the link, No matter what characters can be written, after all, unicode.
So,Selecting the right module is very important!Selecting the right module is very important!Selecting the right module is very important! (Weight 3)
If the content to be crawled is not a fairly standard string or number, I will not use xlwt.
Here there are 4 Python write excel module comparison: http://ju.outofmemory.cn/entry/56671
I cut a comparison chart as follows. For details, refer to the above article. It is very detailed!
Shunteng
Since this is so smooth, we can still write images. Why don't we try it?
Purpose: Replace the content in the column linked to the image with the actual image!
In fact, it is very simple, because we already have an image storage path, just insert it into it.
the_img = "I:\\douban\\image\\"+bookName+".jpg" writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag] for k in range(0,9): if k == 5: worksheet.insert_image(i+j,k,the_img) else: worksheet.write(i+j,k,writelist[k])
The result is as follows: Obviously, it is not beautiful. We should adjust the height of each row and let them center:
According to the xlsxwriter documentation, you can set the row and column width and center in this way: (of course, these operations can be done directly in excel, and they may be faster than writing code, but I want to try this module more)
format = workbookx.add_format()format.set_align('justify')format.set_align('center')format.set_align('vjustify')format.set_align('vcenter')format.set_text_wrap()worksheet.set_row(0,12,format)for i in range(1,251): worksheet.set_row(i,70)worksheet.set_column('A:A',3,format)worksheet.set_column('B:C',17,format)worksheet.set_column('D:D',4,format)worksheet.set_column('E:E',7,format)worksheet.set_column('F:F',10,format)worksheet.set_column('G:G',19,format)worksheet.set_column('H:I',40,format)
At this point, the excel writing is completed, but the setting of the format is complicated and requires constant debugging distance and size, so it is easier to do it in excel.
Final code:
#-*-Coding: UTF-8-*-import requestsimport reimport xlwtimport xlsxwriterfrom bs4 import BeautifulSoupfrom datetime import datetimeimport codecsnow = datetime. now () # start timing print (now) def validate (title): # from nima rstr = r "[\/\\\:\*\? \ "\ <\> \ |]" # '/\: *? "<> |-'New_title = re. sub (rstr, "", title) return new_titletxtfile = codecs. open ("top2501.txt", 'w', 'utf-8') url = "http://book.douban.com/top250? "Header = {" User-Agent ":" Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36 ", "Referer": "http://book.douban.com/"} image_dir = "I: \ douban \ image \" # download image def download_img (imageurl, imageName = "xxx.jpg "): rsp = requests. get (imageurl, stream = True) image = rsp. content path = image_dir + imageName }'.jpg '# print (path) with open (path, 'wb') as f Ile: file. write (image) # create Excelworkbookx = xlsxwriter. workbook ('I :\\ douban \ btop250.xlsx') worksheet = workbookx. add_worksheet () format = workbookx. add_format () format. set_align ('justify ') format. set_align ('center') format. set_align ('vjustify ') format. set_align ('vcenter ') format. set_text_wrap () worksheet. set_row (1,251, format) for I in range (): worksheet. set_row (I, 70) worksheet. set_column ('a: A', 3, format) w Orksheet. set_column ('B: C', 17, format) worksheet. set_column ('d: d', 4, format) worksheet. set_column ('e: e', 7, format) worksheet. set_column ('f: f', 10, format) worksheet. set_column ('G: G', 19, format) worksheet. set_column ('H: I ', 40, format) item = ['title', 'Alias', 'rating', 'rating', 'Cover ', 'book link', 'publication information', 'tag'] for I in range (): worksheet. write (0, I, item [I-1]) s = requests. session () # create Session s. get (url, headers = header) for I in range (0,250, 2 5): geturl = url + "/start =" + str (I) # print ("Now to get" + geturl) page address to be obtained postData = {"start ": i} # post Data res = s. post (url, data = postData, headers = header) # post soup = BeautifulSoup (res. content. decode (), "html. parser ") # BeautifulSoup Parsing table = soup. findAll ('table', {"width": "100%"}) # Find the table sz = len (table) # sz = 25, 25 articles for j in range (1, sz + 1) are listed on each page: # j = 1 ~ 25 sp = BeautifulSoup (str (table [J-1]), "html. parser ") # parse the image URL of each book = sp. img ['src'] # Find the image link bookurl = sp. a ['href '] # Find the book link bookName = sp. div. a ['title'] nickname = sp. div. span # Alias if (nickname): # if an alias exists, store the alias. Otherwise, save 'none' nickname = nickname. string. strip () else: nickname = "" notion = str (sp. find ('P', {"class": "pl "}). string) # capture the publishing information. Pay attention to the content. string is not really str Type rating = str (sp. find ('span ', {"class": "rating_nums "}). string) # capture the split data nums = sp. find ('span ', {"class": "pl "}). string # Number of scored crawlers nums = nums. replace ('(',''). replace (')',''). replace ('\ n ',''). strip () nums = re. findall ('(\ d +) Comments', nums) [0] download_img (imageurl, bookName) # download Image book = requests. get (bookurl) # Open the book webpage sp3 = BeautifulSoup (book. content, "html. parser ") # parse taglist = sp3.find _ all ('A', {" class ":" tag "}) # tag = "" lis = [] for tagurl in taglist: sp4 = BeautifulSoup (str (tagurl), "html. parser ") # parse each label lis. append (str (sp4.a. string) tag = ','. join (lis) # Add the comma the_img = "I: \ douban \ image \" + bookName + ". jpg "writelist = [I + j, bookName, nickname, rating, nums, the_img, bookurl, notion, tag] for k in range (0, 9): if k = 5: worksheet. insert_image (I + j, k, the_img) else: worksheet. write (I + j, k, writelist [k]) txtfile. write (str (writelist [k]) txtfile. write ('\ t') txtfile. write (U' \ r \ n') end = datetime. now () # end Time print (end) print ("program time consumption:" + str (end-now) txtfile. close () workbookx. close ()View Code
The running result is as follows:
2016-03-28 11:40:50. 525635Now to get http://book.douban.com/top250? /Start = 0Now to get http://book.douban.com/top250? /Start = 25Now to get http://book.douban.com/top250? /Start = 50Now to get http://book.douban.com/top250? /Start = 75Now to get http://book.douban.com/top250? /Start = 100Now to get http://book.douban.com/top250? /Start = 125Now to get http://book.douban.com/top250? /Start = 150Now to get http://book.douban.com/top250? /Start = 175Now to get http://book.douban.com/top250? /Start = 200Now to get http://book.douban.com/top250? /Start = 2252016-03-28 11:48:14. 946184 program time consumption: 0:07:24. 420549
Successfully crawled 250 books. This crawling operation is correct and has been completed!
It took about 7 minutes 24 seconds this time, but it was still too slow. The next step should be how to improve efficiency.