[Python Data Analysis] Python3 operation Excel-to Douban books Top250 as an example, excel-top250

Source: Internet
Author: User

[Python Data Analysis] Python3 operation Excel-to Douban books Top250 as an example, excel-top250

In this article, Python3 crawlers are used to capture the top P250 library of Douban, And the xlwt module is used to store the library in an excel file, and the image is downloaded to the corresponding directory. It aims to conduct more crawler practice and module learning.

Tools

1. Python 3.5

2. BeautifulSoup and xlwt modules

Start

First view the url of the target web page: https://book.douban.com/top250? Start = 0, and then I tried to directly use the string connection in the code to change the number following "start =" to traverse all the 250/25 = 10 pages, but later I found that I couldn't do it. In that case, it would always be the first page. So I checked through the F12 developer tool in the browser and found that start was post, for example:

(Figure 1)

Therefore, create a postData dict:

PostData = {"start": I} # I is 0, 25,... 225

Every time you post it, it will solve the problem that the first page is returned.

The Analysis page shows that the listing information of a book and the points to be crawled are as follows:

(Figure 2)

The information to be crawled from top to bottom is:

1.Book URL

2.Cover image LinkI will open this link and download the image to the local device (download_img function)

3.TitleIt should be noted that the title here is the title content, rather than the string information in tag a, because the string information may contain characters such as spaces and line breaks, which may cause trouble in processing, the title format is correct and no additional processing is required.

4.AliasThis is mainly a subtitle or a foreign name. If some books do not have this title, we will write a "none" and never write an empty string. Otherwise, a fault will occur, which I will mention below.

5.Publishing informationSuch as the author, translator, publisher, publication year, and price. This is also one of the important information. Otherwise, the names of multiple books may be different.

6.Rating

7.Rating

In addition, I also crawled a"Tag"Information, which is found on the webpage after the book link is opened. Its location is as follows:

(Figure 3)

When tags are crawled, they are joined with commas (,) as tag values.

Now that you have defined the index to be crawled and understood the webpage structure and the location of the index in the html, you can write the following code:

Geturl = url + "/start =" + str (I) # print ("Now to get" + geturl) postData = {"start ": i} # post Data res = s. post (url, data = postData, headers = header) # post soup = BeautifulSoup (res. content, "html. parser ") # BeautifulSoup Parsing table = soup. findAll ('table', {"width": "100%"}) # Find the table sz = len (table) # sz = 25, 25 articles for j in range (1, sz + 1) are listed on each page: # j = 1 ~ 25 sp = BeautifulSoup (str (table [J-1]), "html. parser ") # parse the information of each book # print (sp. div) imageurl = sp. img ['src'] # Find the image link bookurl = sp. a ['href '] # Find the book link bookName = sp. div. a ['title'] nickname = sp. div. span # Alias if (nickname): # if an alias exists, store the alias. Otherwise, save 'none' nickname = nickname. string. strip () else: nickname = "None" # print (type (imageurl), imageurl) # print (type (bookurl), bookurl) # print (type (bookName), bookName) # print (type (nickname), nickname) notion = str (sp. find ('P', {"class": "pl "}). string) # capture the publishing information. Pay attention to the content. string is not really 'str' type # print (type (notion), notion) rating = str (sp. find ('span ', {"class": "rating_nums "}). string) # capture the split data nums = sp. find ('span ', {"class": "pl "}). string # Number of scored crawlers nums = nums. replace ('(',''). replace (')',''). replace ('\ n ',''). strip () nums = re. findall ('(\ d +) Comments', nums) [0] # print (type (rating), rating) # print (type (nums), nums) download_img (imageurl, bookName) # download Image book = requests. get (bookurl) # Open the book webpage sp3 = BeautifulSoup (book. content, "html. parser ") # parse taglist = sp3.find _ all ('A', {" class ":" tag "}) # tag = "" lis = [] for tagurl in taglist: sp4 = BeautifulSoup (str (tagurl), "html. parser ") # parse each label lis. append (str (sp4.a. string) tag = ','. join (lis) # Add a comma if tag = "": # if the tag is empty, set "None" tag = "None"
Xls file storage via xlwt module and Problems

I want to try to save the token to the excel file (.xls). So I found that python can use the xlwt and xlrd modules for excel operations. Although they only support excel2003 for the time being, it is enough. Xlwtis the module of pythongener.xls file, and xlrd is read. Because I want to directly generate the xls file without using xlrd, install xlwt first.

Directly go to the Python directory and run the following command to install xlwt:

pip3 install xlwt

After the installation is complete, write the operation code. txt files are also written here to facilitate comparison:

# Create Excelworkbook = xlwt. workbook (encoding = 'utf-8') sheet = workbook. add_sheet ('book _ top250', cell_overwrite_ OK = True) item = ['title ', 'Alias', 'rating', 'rating size', 'Cover ', 'book link', 'publication information', 'tag'] for I in range (): sheet. write (0, I, item [I-1])... for j in range (1, sz + 1 ):... writelist = [I + j, bookName, nickname, float (rating), int (nums), "I: \ douban \ image \" + bookName + ". jpg ", bookurl, notion, tag] for k in range (0, 9): sheet. write (I + j, k, writelist [k]) txtfile. write (str (writelist [k]) txtfile. write ('\ t') txtfile. write (U' \ r \ n') workbook. save ("I: \ douban \ booktop250.xls ")
...

I thought this would work, but there are still some errors.

For example, there was a situation where writing could not be written (The following is not the result of the above Code ):

(Figure 4)

At this time, I changed not str to str. I tried to use numbers (int, float) instead of str. Then I encountered the following situation:

(Figure 5)

I can't write any more than 64th items, but all the int and float items have been written, and the number of 'none' items is displayed intermittently. I think, since the problem cannot be found, let's take it easy. First, it is very likely that it is a problem of Chinese encoding, because I will correctly display all the characters that can not be 'str' into non-'str, in addition, the display in the picture is broken in the path name, so I made that column not show, actually, it succeeded!

(Figure 6)

Except for the column not displayed, the other columns are completely normal. It can be determined that the following error occurs:

writelist=[i+j,bookName,nickname,float(rating),int(nums),"I:\\douban\\image\\"+bookName+".jpg",bookurl,notion,tag]

My image path is directly concatenated by strings, so there may be encoding errors. Let's try it again:

imgpath = str("I:\\douban\\image\\"+bookName+".jpg");writelist=[i+j,bookName,nickname,float(rating),int(nums),imgpath,bookurl,notion,tag]

Okay, I still don't know. I still see the problem in figure 5, but it is correct to print it in Python IDLE.

In this case, change all the image links to the same English one:

imgpath = str("I:\\douban\\image\\"+"a"+".jpg")writelist=[i+j,bookName,nickname,float(rating),int(nums),imgpath,bookurl,notion,tag]

Correct: ('none' has been changed to 'None ')

(Figure 7)

Therefore, it is still the problem of the image path. We simply replace the column in the image path with the image link and adopt a negative response method. This is irrelevant to the image link or the image path, there are images in the image path. In addition, I also added a timing code to calculate the total crawling time, because I think it is too slow to do so, and it cannot be completed for nearly 10 minutes. I want to use multiple threads to crawl, here we will record the time for later effect. Then I found that the problem still persists! Now, as long as the imageurl is fixed (the Chinese character can also be used), it can be smoothly output to the xls, otherwise it will not work. Very strange. So I tried to shorten the imageurl again. The experiment learned that it is okay to get the imageurl [:-6], but the imageurl [:-5] won't work. The imageurl column is not written at the end. Yes. It is normal that you do not write aliases or book links, but you cannot write labels. It remains unsolved. I guess the number of characters written is limited? It depends on more experiments. In addition, it may be the encoding problem in Windows, which requires experiment judgment based on Linux. So there are still a lot of things to do. Here we will do most of the correct work first.

Therefore, we simply do not need a column of book addresses and finally come up with the following final code:

#-*-Coding: UTF-8-*-import requestsimport reimport xlwtfrom bs4 import BeautifulSoupfrom datetime import datetimeimport codecsnow = datetime. now () # start timing print (now) txtfile = codecs. open ("top250.txt", 'w', 'utf-8') url = "http://book.douban.com/top250? "Header = {" User-Agent ":" Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36 ", "Referer": "http://book.douban.com/"} image_dir = "I: \ douban \ image \" # download image def download_img (imageurl, imageName = "xxx.jpg "): rsp = requests. get (imageurl, stream = True) image = rsp. content path = image_dir + imageName }'.jpg '# print (path) with open (path, 'wb') as f Ile: file. write (image) # create Excelworkbook = xlwt. workbook (encoding = 'utf-8') sheet = workbook. add_sheet ('book _ top250', cell_overwrite_ OK = True) item = ['title ', 'Alias', 'rating', 'rating size', 'Cover ', 'book link', 'publication information', 'tag'] for I in range (): sheet. write (0, I, item [I-1]) s = requests. session () # create Session s. get (url, headers = header) for I in range (0,250, 25): geturl = url + "/start =" + str (I) # print ("Now to get" + geturl) postData = {"Start": I} # post Data res = s. post (url, data = postData, headers = header) # post soup = BeautifulSoup (res. content. decode (), "html. parser ") # BeautifulSoup Parsing table = soup. findAll ('table', {"width": "100%"}) # Find the table sz = len (table) # sz = 25, 25 articles for j in range (1, sz + 1) are listed on each page: # j = 1 ~ 25 sp = BeautifulSoup (str (table [J-1]), "html. parser ") # parse the information of each book # print (sp. div) imageurl = sp. img ['src'] # Find the image link bookurl = sp. a ['href '] # Find the book link bookName = sp. div. a ['title'] nickname = sp. div. span # Alias if (nickname): # if an alias exists, store the alias. Otherwise, save 'none' nickname = nickname. string. strip () else: nickname = "None" # print (type (imageurl), imageurl) # print (type (bookurl), bookurl) # print (type (bookName), bookName) # print (type (nickname), nickname) notion = str (sp. find ('P', {"class": "pl "}). string) # capture the publishing information. Pay attention to the content. string is not really 'str' type # print (type (notion), notion) rating = str (sp. find ('span ', {"class": "rating_nums "}). string) # capture the split data nums = sp. find ('span ', {"class": "pl "}). string # Number of scored crawlers nums = nums. replace ('(',''). replace (')',''). replace ('\ n ',''). strip () nums = re. findall ('(\ d +) Comments', nums) [0] # print (type (rating), rating) # print (type (nums), nums) download_img (imageurl, bookName) # download Image book = requests. get (bookurl) # Open the book webpage sp3 = BeautifulSoup (book. content, "html. parser ") # parse taglist = sp3.find _ all ('A', {" class ":" tag "}) # tag = "" lis = [] for tagurl in taglist: sp4 = BeautifulSoup (str (tagurl), "html. parser ") # parse each label lis. append (str (sp4.a. string) tag = ','. join (lis) # Add a comma if tag = "": # if the tag is empty, set "None" tag = "None" writelist = [I + j, bookName, nickname, float (rating), int (nums), imageurl, bookurl, notion, tag] for k in range (0, 9): if (k = 5): continue sheet. write (I + j, k, writelist [k]) txtfile. write (str (writelist [k]) txtfile. write ('\ t') txtfile. write (U' \ r \ n') workbook. save ("I: \ douban \ booktop250.xls") end = datetime. now () # end Time print (end) print ("program time consumption:" + str (end-now) txtfile. close ()View Code

Run (7 minutes or more ):

(Figure 8)

If it's still broken, I don't know what to do. Then, change the method. First, write the TXT text file and import it to the xls file, regardless of the title of this article.

Run:

2016-03-27 21:47:17. 914149Now to get http://book.douban.com/top250? /Start = 0Now to get http://book.douban.com/top250? /Start = 25Now to get http://book.douban.com/top250? /Start = 50Now to get http://book.douban.com/top250? /Start = 75Now to get http://book.douban.com/top250? /Start = 100Now to get http://book.douban.com/top250? /Start = 125Now to get http://book.douban.com/top250? /Start = 150Now to get http://book.douban.com/top250? /Start = 175Now to get http://book.douban.com/top250? /Start = 200Now to get http://book.douban.com/top250? /Start = 2252016-03-27 21:56:16. 046792 program time consumption: 0:08:58. 132643

Which is correct in .txt:

(Figure 9)

Then select "data"> "import data" in the xls file to get the final result:

(Figure 10)

Cover image:

(Figure 11)

The problem is solved first, and further research is needed.

Later improvements

1. Accelerate crawling with multiple processes and threads

2. You can use the xlutis module to write data to excel in multiple steps.

3. consider changing the excel Processing Module

3. Test in Linux environment

Bytes ------------------------------------------------------------------------------------------------------

I heard people say that most of the data analysis time is spent on data collection and cleaning. I didn't think much about it before. Now I have a little bit of experience, so I have a long way to go ..

If you have any opinions on my methods, please leave your comments here :-)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.