Python Crawler's novel Crawl

Source: Internet
Author: User

Don't say much nonsense, go straight to the chase.

The site I'm going to crawl today is the starting point of the Chinese web, the content is a novel.

The first is the introduction of libraries

 from Import Urlopen  from Import BeautifulSoup

Then assign the URL to a value

Html=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html" )//The   first chapter of the novel's URL bsobj=beautifulsoup (HTML)                                                                 //Create BeautifulSoup Object

First try to crawl the contents of the novel on this page

Firstchapter=bsobj.find ("div", {"class","  Read-content"})                                 //find method is a function owned by the BeautifulSoup object,print (firstchapter.read_ Text ())

The Find method can also be used in conjunction with regular expressions and is used for crawling of resources such as pictures, videos, and more.

Since this crawl is all in a box with a class attribute value of Read-content, the Find method is used, and if the text is placed in multiple boxes in the page, the FindAll method should be used, and the return value is a collection that needs to be iterated through the output.

The code is consolidated and found to be able to crawl the article, but the question now is to crawl the chapter of the novel, so how do you crawl the next few chapters?

From the previous steps can be drawn, as long as you know the next chapter of the Web site, you can crawl. First, the portion of the printed text is encapsulated as a function, then each time you get a new address, you can print out the corresponding text

def writenovel (HTML):    bsobj=beautifulsoup (HTML)    Chapter=bsobj.find ("Div  ", {"class","read-content"})     Print (Chapter.get_text ())

Now the question is how to crawl down a chapter of the Web site, to observe the structure of the Web page, the next chapter of the button is a j_chapternext ID is a tag, then this tag can be obtained from the next chapter of the website

RePack functions, organized by:

From urllib.request import Urlopen
From BS4 import BeautifulSoup
def writenovel (HTML):
Bsobj=beautifulsoup (HTML)
Chapter=bsobj.find ("div", {"Class", "Read-content"})
Print (Chapter.get_text ())
Bsoup=bsobj.find ("", {"id": "J_chapternext"})
Html2= "http:" +bsoup.get (' href ') + ". html"
Return (Urlopen (HTML2))

Html=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")

I=1
while (I<10):
Html=writenovel (HTML)
I=i+1

To write text to a file in text

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupdefwritenovel (HTML): Bsobj=BeautifulSoup (HTML) Chapter=bsobj.find ("Div",{"class","read-content"})    Print(Chapter.get_text ()) fo=open ("Novel.text","a") Fo.write (Chapter.get_text ()) Fo.close Bsoup=bsobj.find ("",{"ID":"J_chapternext"}) HTML2="http:"+bsoup.get ('href')+". html"    return(Urlopen (HTML2)) HTML=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html") I=1 while(i<8): HTML=writenovel (HTML) I=i+1

Python Crawler's novel Crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.