How did Python crawl the prose web article?

Source: Internet
Author: User

Image.png

Configure Python 2.7

    BS4    Requests

Install with pip install sudo pip install BS4

sudo pip install requests

Briefly explain the use of BS4 because it's crawling the web, so I'll introduce find and Find_all.

The difference between find and Find_all is that it returns something different. Find returns the first tag and the contents of the tag.

Find_all returns a list

For example, we write a test.html to test the difference between find and Find_all. The content is:



Then the test.py code is:

From BS4 import beautifulsoupimport lxmlif __name__== ' __main__ ':  s = beautifulsoup (open (' test.html '), ' lxml ')  Print s.prettify ()  print "------------------------------"  print s.find (' div ')  print s.find_all (' div ') Print  "------------------------------" Print  s.find (' div ', id= ' one ')  print s.find_all (' div ', id= ' one ') Print  "------------------------------" Print  s.find (' div ', id= ")  print s.find_all (' div ', id=" Print  "------------------------------" Print  s.find (' div ', id= "three")  print s.find_all (' div ', id= "Three") print  "------------------------------" Print  s.find (' div ', id= "four")  print S.find_all (' Div ', id= "four")  print "------------------------------"


We can see the results when we get to the specified label. When you get a set of labels, the difference between them is displayed.


Image.png

So we need to pay attention to what is in use, otherwise there will be an error
The next step is to get the Web information through requests, I don't quite understand why people write heard and other things.
I go directly to Web Access, get the prose web by getting a few categories of the two-level Web page and then through a group of tests, put all the pages crawled again

Def get_html ():  url = ""  two_html = [' Sanwen ', ' shige ', ' Zawen ', ' Suibi ', ' Rizhi ', ' novel '] for  doc in two_html :      i=1          if doc== ' Sanwen ':p rint "running Sanwen-----------------------------"  if doc== ' Shige ':p rint " Running Shige------------------------------"  if doc== ' Zawen ':p rint ' running Zawen----------------------------- --'  if doc== ' Suibi ':p rint ' running Suibi-------------------------------'  if doc== ' Rizhi ':p rint ' running Ruzhi-------------------------------'  if doc== ' Nove ':p rint ' running Xiaoxiaoshuo-------------------------' While  (i<10):        par = {' P ': i}        res = Requests.get (url+doc+ '/', Params=par) if res.status_code==200:          Soup (res.text)              i+=i


In this part of the code I did not deal with Res.status_code not 200, causing the problem is that the error will not be displayed and the content of the crawl will be lost. And then analyze the web pages of prose, found to be www.sanwen.net/rizhi/&p=1
P Maximum is 10 this is not very understanding, the last crawl is a lot of 100 pages, forget it later analysis. The contents of each page are then obtained through the Get method.
Get each page content is the analysis of the author and the topic of the code is this

def soup (html_text):  s = BeautifulSoup (Html_text, ' lxml ')  link = s.find (' div ', class_= ' categorylist '). Find_ All (' Li ')  for I in Link:if i!=s.find (' Li ', class_= ' page '):      title = I.find_all (' a ') [1]      author = i.find_all (' A ') [2].text      url = title.attrs[' href '] sign      = Re.compile (R ' (//) |/'      ) match = Sign.search (title.text)      file_name = Title.text      if match:        file_name = sign.sub (' A ', str (title.text))


Get the title when the pit dad, ask the big boys to write prose you title plus slash why, not only add a plus two, this problem directly caused me to write files later when the file name error, so write regular expression, I gave you a row.
The last is to obtain the prose content, through the analysis of each page, get the article address, and then directly get the content, originally also want to directly through the web address of a one to obtain it, so it is also convenient.

def get_content (URL):  res = requests.get (' +url)  if res.status_code==200:    soup = BeautifulSoup (Res.text, ' lxml ')    contents = soup.find (' div ', class_= ' content '). Find_all (' P ')    content = ' For I in contents:      content +=i.text+ ' \ n ' return content


Finally, write file save ok

   f = open (file_name+ '. txt ', ' W ')      print ' Running W txt ' +file_name+ '. txt '  f.write (title.text+ ' \ n ')      F.write ( author+ ' \ n ')      content=get_content (URL)           f.write (content)      F.close ()

Three functions to get prose web prose, but there are problems, the problem is that I do not know why some prose lost I can only get to about 400 articles, this prose net article is a lot of difference, but it is a page by page of the acquisition, this problem hope the big guy to help see. Probably should do the Web page inaccessible processing, of course, I think with my dorm this broken net has relations

     f = open (file_name+ '. txt ', ' W ')      print ' Running W txt ' +file_name+ '. txt '  f.write (title.text+ ' \ n ')      F.write ( author+ ' \ n ')      content=get_content (URL)           f.write (content)      F.close ()

I almost forgot.

The code is messy, but I never stop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.