Climbed a watercress novel.

Source: Internet
Author: User

Book Friends often people make a book shortage, to recommend, so I thought to think, the watercress inside the novels have climbed down. Or use the old method, urllib+ is extracted, no use of scrapy so high-tech things (in fact, because Windows is too difficult to install). But this time I'm using a python3. In fact, I just found a bit on the internet how to write the HTTP header, the Web page is Python3, I also used. Python2 should be similar. In fact, the method is quite simple. First look at the watercress above labeled "novel" The Address of the book: Https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4 try to turn back and forth, you can see the URL after the addition of a
? start=0&type=t
The two get parameters, where start represents the beginning of the first few books (20 Book one page), type represents the way the book is sorted. In fact, the watercress has the default in accordance with the rating, but not allowed, so, just sort by default, download it yourself manually sort it out! First grab a webpage:
defget_html (URL):"""Crawling Web pages"""CJ=Http.cookiejar.CookieJar () opener=Urllib.request.build_opener (Urllib.request.HTTPCookieProcessor (CJ)) Opener.addheaders= [('user-agent',                          'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36'),                         ('Cookies','4564564564564564565646540')] Urllib.request.install_opener (opener)Try:         whileTrue:page=urllib.request.urlopen (URL) HTML= Page.read (). Decode ("Utf-8") page.close () Anti_spider= Re.findall (r'403 Forbidden', HTML)ifAnti_spider:Print("anti-reptile, rest for 10 minutes ...") Time.sleep (600)            Else:                returnHTMLexceptException as E:Print(e) sys.exit ()
Regular extraction is performed on this page. In the process of extraction, I found that only the links and titles of books, and other content such as the author's name and score, are generated through the JS dynamic. Pop. Then you have to use the link to get the author and score of each book.
defget_books_info (HTML):"""get a page of information about all the books above"""Books=[] One_page_books= Re.findall (r'href= "(https://book\.douban\.com/subject/\d+/)" title= "(. *?) "', HTML) forURL, nameinchone_page_books:one_book_html=get_html (URL)#Print one_book_htmlAuthor, score =get_book_info (one_book_html)#print author, score        if  notAuthor and  notScore: Breakname="the"+ name +"""        Print('{"Name":%s, "author":%s, "score":%s, "url":%s}'%(name, author, score, URL)) books.append ({"name": Name,"author": Author,"score": Score,"URL": URL}) #print ("Don't climb too fast, take a break")Time.sleep (3)    returnBooks

There is a very important thing to say, do reptiles, must remember sleep, not only camouflage human behavior confusing anti-crawler, but also the site of respect. Internet access to a variety of sites has been 90% is a crawler, most of which are search engines, we do not give others site trouble--processing your home crawler is to consume server resources, continuous crawling is uncivilized behavior! I did not sleep at the beginning, the result in the debugging process climbed hundreds of pages, and then was anti-crawler. Well, you should apologize to the watercress. People have been very polite, let me boldly climbed hundreds of books, just to seal me off, and the watercress itself to provide the use of the crawler API, I also struggled to write the regular. Here are the authors and ratings from a book page:
defget_book_info (HTML):"""get a book of ratings and authors"""    Try: Score= Re.findall (r'property= "V:average" > (. *?) </strong>', HTML) [0] Author= Re.findall (r'<span class= "PL" > Author </span>[\w\w]*?<a class= "" href= ". *?" > (. *?) </a>', HTML) [0]returnAuthor, scoreexceptException as E:Print("The score and the author had a little problem:")        Print(e)return "",""
There is nothing to say in this paragraph, it is all a lazy spelling of the regular. Combine the above several functions together:
defMain (): page=0 Books= []     whileTrue:url="Https://book.douban.com/tag/%%E5%%B0%%8F%%E8%%AF%%B4?start=%s&type=T"%STR (page) HTML=get_html (URL) one_page_books=get_books_info (HTML) books.extend (one_page_books)ifLen (one_page_books) = =0:Print("All right, these are the only books.")             Breakpage+ = 20#Some book evaluation of the number of people too little, no score, afraid of incomplete information, complete a     forBinchbooks:b["author"] = b["author"]or "(anonymous)"b["score"] = b["score"]or "0.0"    #SortBooks.sort (key=LambdaX:float (x["score"]), reverse=True)#Save book InformationWith open ("Douban.csv","W", encoding="GBK", errors="Ignore") as F:line_template="% (name) s,% (author) s,% (score) s,% (URL) s\n"F.write (line_template% ({"name":"title","author":"author","score":"Watercress Score","URL":"Watercress Link"}))         forBookinchBooks:f.write (line_template% book)

Finally in the open when actually encountered a little trouble: the page is utf-8 encoded, I climbed down the page after the direct decode ("Utf-8"). However, under Windows, the Open ("Douban.csv", "W") function is used by default in the Chinese encoding method is GBK, so the write time will be directly thrown a Chinese code exception. After checking this out, I added the encoding= "utf-8" parameter. This is a successful save, but the CSV default in Excel when opened or garbled Ah! Because Excel uses GBK to open the. Can be manually first opened with Notepad, save as select ANSI to solve, but I do not disdain in this way, in the group asked, there are classmates told me with ignore, burning goose, in Python3 inside also encode/decode write a string, not ugly die! So Baidu, finally found in the Python3, the open function can be directly added errors= "Ignore" this parameter. Finish ~ The final effect: on the sauce. The code has been uploaded to GitHub (https://github.com/anpengapple/doubanbooks). There is a need for students to play ~ actually still not perfect: first, there is no detailed type of fiction; second, there are only novels and no other books; third, not directly save Excel, you have to typesetting. In the future, Bambo can be changed. (the words have not learned to write Excel also because lazy, feel all is manual work, completely no skill to say, not fun).

Climbed a watercress novel.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.