Don't say much nonsense, go straight to the chase.
The site I'm going to crawl today is the starting point of the Chinese web, the content is a novel.
The first is the introduction of libraries
from Import Urlopen from Import BeautifulSoup
Then assign the URL to a value
Html=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html" )//The first chapter of the novel's URL bsobj=beautifulsoup (HTML) //Create BeautifulSoup Object
First try to crawl the contents of the novel on this page
Firstchapter=bsobj.find ("div", {"class"," Read-content"}) //find method is a function owned by the BeautifulSoup object,print (firstchapter.read_ Text ())
The Find method can also be used in conjunction with regular expressions and is used for crawling of resources such as pictures, videos, and more.
Since this crawl is all in a box with a class attribute value of Read-content, the Find method is used, and if the text is placed in multiple boxes in the page, the FindAll method should be used, and the return value is a collection that needs to be iterated through the output.
The code is consolidated and found to be able to crawl the article, but the question now is to crawl the chapter of the novel, so how do you crawl the next few chapters?
From the previous steps can be drawn, as long as you know the next chapter of the Web site, you can crawl. First, the portion of the printed text is encapsulated as a function, then each time you get a new address, you can print out the corresponding text
def writenovel (HTML): bsobj=beautifulsoup (HTML) Chapter=bsobj.find ("Div ", {"class","read-content"}) Print (Chapter.get_text ())
Now the question is how to crawl down a chapter of the Web site, to observe the structure of the Web page, the next chapter of the button is a j_chapternext ID is a tag, then this tag can be obtained from the next chapter of the website
RePack functions, organized by:
From urllib.request import Urlopen
From BS4 import BeautifulSoup
def writenovel (HTML):
Bsobj=beautifulsoup (HTML)
Chapter=bsobj.find ("div", {"Class", "Read-content"})
Print (Chapter.get_text ())
Bsoup=bsobj.find ("", {"id": "J_chapternext"})
Html2= "http:" +bsoup.get (' href ') + ". html"
Return (Urlopen (HTML2))
Html=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html")
I=1
while (I<10):
Html=writenovel (HTML)
I=i+1
To write text to a file in text
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulSoupdefwritenovel (HTML): Bsobj=BeautifulSoup (HTML) Chapter=bsobj.find ("Div",{"class","read-content"}) Print(Chapter.get_text ()) fo=open ("Novel.text","a") Fo.write (Chapter.get_text ()) Fo.close Bsoup=bsobj.find ("",{"ID":"J_chapternext"}) HTML2="http:"+bsoup.get ('href')+". html" return(Urlopen (HTML2)) HTML=urlopen ("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html") I=1 while(i<8): HTML=writenovel (HTML) I=i+1
Python Crawler's novel Crawl