Image.png
Configure Python 2.7
BS4 Requests
Install with pip install sudo pip install BS4
sudo pip install requests
Briefly explain the use of BS4 because it's crawling the web, so I'll introduce find and Find_all.
The difference between find and Find_all is that it returns something different. Find returns the first tag and the contents of the tag.
Find_all returns a list
For example, we write a test.html to test the difference between find and Find_all. The content is:
Then the test.py code is:
From BS4 import beautifulsoupimport lxmlif __name__== ' __main__ ': s = beautifulsoup (open (' test.html '), ' lxml ') Print s.prettify () print "------------------------------" print s.find (' div ') print s.find_all (' div ') Print "------------------------------" Print s.find (' div ', id= ' one ') print s.find_all (' div ', id= ' one ') Print "------------------------------" Print s.find (' div ', id= ") print s.find_all (' div ', id=" Print "------------------------------" Print s.find (' div ', id= "three") print s.find_all (' div ', id= "Three") print "------------------------------" Print s.find (' div ', id= "four") print S.find_all (' Div ', id= "four") print "------------------------------"
We can see the results when we get to the specified label. When you get a set of labels, the difference between them is displayed.
Image.png
So we need to pay attention to what is in use, otherwise there will be an error
The next step is to get the Web information through requests, I don't quite understand why people write heard and other things.
I go directly to Web Access, get the prose web by getting a few categories of the two-level Web page and then through a group of tests, put all the pages crawled again
Def get_html (): url = "" two_html = [' Sanwen ', ' shige ', ' Zawen ', ' Suibi ', ' Rizhi ', ' novel '] for doc in two_html : i=1 if doc== ' Sanwen ':p rint "running Sanwen-----------------------------" if doc== ' Shige ':p rint " Running Shige------------------------------" if doc== ' Zawen ':p rint ' running Zawen----------------------------- --' if doc== ' Suibi ':p rint ' running Suibi-------------------------------' if doc== ' Rizhi ':p rint ' running Ruzhi-------------------------------' if doc== ' Nove ':p rint ' running Xiaoxiaoshuo-------------------------' While (i<10): par = {' P ': i} res = Requests.get (url+doc+ '/', Params=par) if res.status_code==200: Soup (res.text) i+=i
In this part of the code I did not deal with Res.status_code not 200, causing the problem is that the error will not be displayed and the content of the crawl will be lost. And then analyze the web pages of prose, found to be www.sanwen.net/rizhi/&p=1
P Maximum is 10 this is not very understanding, the last crawl is a lot of 100 pages, forget it later analysis. The contents of each page are then obtained through the Get method.
Get each page content is the analysis of the author and the topic of the code is this
def soup (html_text): s = BeautifulSoup (Html_text, ' lxml ') link = s.find (' div ', class_= ' categorylist '). Find_ All (' Li ') for I in Link:if i!=s.find (' Li ', class_= ' page '): title = I.find_all (' a ') [1] author = i.find_all (' A ') [2].text url = title.attrs[' href '] sign = Re.compile (R ' (//) |/' ) match = Sign.search (title.text) file_name = Title.text if match: file_name = sign.sub (' A ', str (title.text))
Get the title when the pit dad, ask the big boys to write prose you title plus slash why, not only add a plus two, this problem directly caused me to write files later when the file name error, so write regular expression, I gave you a row.
The last is to obtain the prose content, through the analysis of each page, get the article address, and then directly get the content, originally also want to directly through the web address of a one to obtain it, so it is also convenient.
def get_content (URL): res = requests.get (' +url) if res.status_code==200: soup = BeautifulSoup (Res.text, ' lxml ') contents = soup.find (' div ', class_= ' content '). Find_all (' P ') content = ' For I in contents: content +=i.text+ ' \ n ' return content
Finally, write file save ok
f = open (file_name+ '. txt ', ' W ') print ' Running W txt ' +file_name+ '. txt ' f.write (title.text+ ' \ n ') F.write ( author+ ' \ n ') content=get_content (URL) f.write (content) F.close ()
Three functions to get prose web prose, but there are problems, the problem is that I do not know why some prose lost I can only get to about 400 articles, this prose net article is a lot of difference, but it is a page by page of the acquisition, this problem hope the big guy to help see. Probably should do the Web page inaccessible processing, of course, I think with my dorm this broken net has relations
f = open (file_name+ '. txt ', ' W ') print ' Running W txt ' +file_name+ '. txt ' f.write (title.text+ ' \ n ') F.write ( author+ ' \ n ') content=get_content (URL) f.write (content) F.close ()
I almost forgot.
The code is messy, but I never stop