Book Friends often people make a book shortage, to recommend, so I thought to think, the watercress inside the novels have climbed down. Or use the old method, urllib+ is extracted, no use of scrapy so high-tech things (in fact, because Windows is too difficult to install). But this time I'm using a python3. In fact, I just found a bit on the internet how to write the HTTP header, the Web page is Python3, I also used. Python2 should be similar. In fact, the method is quite simple. First look at the watercress above labeled "novel" The Address of the book: Https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4 try to turn back and forth, you can see the URL after the addition of a
? start=0&type=t
The two get parameters, where start represents the beginning of the first few books (20 Book one page), type represents the way the book is sorted. In fact, the watercress has the default in accordance with the rating, but not allowed, so, just sort by default, download it yourself manually sort it out! First grab a webpage:
defget_html (URL):"""Crawling Web pages"""CJ=Http.cookiejar.CookieJar () opener=Urllib.request.build_opener (Urllib.request.HTTPCookieProcessor (CJ)) Opener.addheaders= [('user-agent', 'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36'), ('Cookies','4564564564564564565646540')] Urllib.request.install_opener (opener)Try: whileTrue:page=urllib.request.urlopen (URL) HTML= Page.read (). Decode ("Utf-8") page.close () Anti_spider= Re.findall (r'403 Forbidden', HTML)ifAnti_spider:Print("anti-reptile, rest for 10 minutes ...") Time.sleep (600) Else: returnHTMLexceptException as E:Print(e) sys.exit ()
Regular extraction is performed on this page. In the process of extraction, I found that only the links and titles of books, and other content such as the author's name and score, are generated through the JS dynamic. Pop. Then you have to use the link to get the author and score of each book.
defget_books_info (HTML):"""get a page of information about all the books above"""Books=[] One_page_books= Re.findall (r'href= "(https://book\.douban\.com/subject/\d+/)" title= "(. *?) "', HTML) forURL, nameinchone_page_books:one_book_html=get_html (URL)#Print one_book_htmlAuthor, score =get_book_info (one_book_html)#print author, score if notAuthor and notScore: Breakname="the"+ name +""" Print('{"Name":%s, "author":%s, "score":%s, "url":%s}'%(name, author, score, URL)) books.append ({"name": Name,"author": Author,"score": Score,"URL": URL}) #print ("Don't climb too fast, take a break")Time.sleep (3) returnBooks
There is a very important thing to say, do reptiles, must remember sleep, not only camouflage human behavior confusing anti-crawler, but also the site of respect. Internet access to a variety of sites has been 90% is a crawler, most of which are search engines, we do not give others site trouble--processing your home crawler is to consume server resources, continuous crawling is uncivilized behavior! I did not sleep at the beginning, the result in the debugging process climbed hundreds of pages, and then was anti-crawler. Well, you should apologize to the watercress. People have been very polite, let me boldly climbed hundreds of books, just to seal me off, and the watercress itself to provide the use of the crawler API, I also struggled to write the regular. Here are the authors and ratings from a book page:
defget_book_info (HTML):"""get a book of ratings and authors""" Try: Score= Re.findall (r'property= "V:average" > (. *?) </strong>', HTML) [0] Author= Re.findall (r'<span class= "PL" > Author </span>[\w\w]*?<a class= "" href= ". *?" > (. *?) </a>', HTML) [0]returnAuthor, scoreexceptException as E:Print("The score and the author had a little problem:") Print(e)return "",""
There is nothing to say in this paragraph, it is all a lazy spelling of the regular. Combine the above several functions together:
defMain (): page=0 Books= [] whileTrue:url="Https://book.douban.com/tag/%%E5%%B0%%8F%%E8%%AF%%B4?start=%s&type=T"%STR (page) HTML=get_html (URL) one_page_books=get_books_info (HTML) books.extend (one_page_books)ifLen (one_page_books) = =0:Print("All right, these are the only books.") Breakpage+ = 20#Some book evaluation of the number of people too little, no score, afraid of incomplete information, complete a forBinchbooks:b["author"] = b["author"]or "(anonymous)"b["score"] = b["score"]or "0.0" #SortBooks.sort (key=LambdaX:float (x["score"]), reverse=True)#Save book InformationWith open ("Douban.csv","W", encoding="GBK", errors="Ignore") as F:line_template="% (name) s,% (author) s,% (score) s,% (URL) s\n"F.write (line_template% ({"name":"title","author":"author","score":"Watercress Score","URL":"Watercress Link"})) forBookinchBooks:f.write (line_template% book)
Finally in the open when actually encountered a little trouble: the page is utf-8 encoded, I climbed down the page after the direct decode ("Utf-8"). However, under Windows, the Open ("Douban.csv", "W") function is used by default in the Chinese encoding method is GBK, so the write time will be directly thrown a Chinese code exception. After checking this out, I added the encoding= "utf-8" parameter. This is a successful save, but the CSV default in Excel when opened or garbled Ah! Because Excel uses GBK to open the. Can be manually first opened with Notepad, save as select ANSI to solve, but I do not disdain in this way, in the group asked, there are classmates told me with ignore, burning goose, in Python3 inside also encode/decode write a string, not ugly die! So Baidu, finally found in the Python3, the open function can be directly added errors= "Ignore" this parameter. Finish ~ The final effect: on the sauce. The code has been uploaded to GitHub (https://github.com/anpengapple/doubanbooks). There is a need for students to play ~ actually still not perfect: first, there is no detailed type of fiction; second, there are only novels and no other books; third, not directly save Excel, you have to typesetting. In the future, Bambo can be changed. (the words have not learned to write Excel also because lazy, feel all is manual work, completely no skill to say, not fun).
Climbed a watercress novel.