Use python to crawl articles from the prose network,

Source: Internet
Author: User

Use python to crawl articles from the prose network,

Image.png

Configure the python 2.7

    bs4    requests

Install sudo pip with pip install bs4

Sudo pip install requests

This section briefly describes how to use bs4 because it crawls a webpage, so we will introduce find and find_all.

The difference between find and find_all is that different things are returned. find returns the first matching tag and the content in the tag.

Find_all returns a list.

For example, we wrote test.html to test the difference between find and find_all. Content:

 

 

Then the code for test. py is:

from bs4 import BeautifulSoupimport lxmlif __name__=='__main__':  s = BeautifulSoup(open('test.html'),'lxml')  print s.prettify()  print "------------------------------"  print s.find('div')  print s.find_all('div')  print "------------------------------"  print s.find('div',id='one')  print s.find_all('div',id='one')  print "------------------------------"  print s.find('div',id="two")  print s.find_all('div',id="two")  print "------------------------------"  print s.find('div',id="three")  print s.find_all('div',id="three")  print "------------------------------"  print s.find('div',id="four")  print s.find_all('div',id="four")  print "------------------------------"

 

 

After running, we can see that the results are not very different when getting a specified tag. When getting a group of tags, the differences between the two are displayed.


Image.png

So we should pay attention to what we need when using it. Otherwise, an error will occur.
The next step is to get web page information through requests. I don't know why others want to write heard and other things.
I directly access the web page, get several secondary web pages of the prose network through the get method, and then crawl all the web pages through a group of tests.

def get_html():  url = "https://www.sanwen.net/"  two_html = ['sanwen','shige','zawen','suibi','rizhi','novel']  for doc in two_html:      i=1          if doc=='sanwen':            print "running sanwen -----------------------------"          if doc=='shige':            print "running shige ------------------------------"          if doc=='zawen':            print 'running zawen -------------------------------'          if doc=='suibi':            print 'running suibi -------------------------------'          if doc=='rizhi':            print 'running ruzhi -------------------------------'          if doc=='nove':            print 'running xiaoxiaoshuo -------------------------'      while(i<10):        par = {'p':i}        res = requests.get(url+doc+'/',params=par)        if res.status_code==200:          soup(res.text)              i+=i

 

 

In this part of the code, I did not process res. status_code if it is not 200. The problem is that no error is displayed and the crawled content is lost. Then I analyzed the prose web page and found that it was www.sanwen.net/rizhi/&p;1.
The maximum value of p is 10, which is hard to understand. The last time I crawled more than 100 pages, I will analyze it after I forget it. Then get the content of each page through the get method.
After each page is obtained, the author and question are analyzed. The code is like this.

def soup(html_text):  s = BeautifulSoup(html_text,'lxml')  link = s.find('div',class_='categorylist').find_all('li')  for i in link:    if i!=s.find('li',class_='page'):      title = i.find_all('a')[1]      author = i.find_all('a')[2].text      url = title.attrs['href']      sign = re.compile(r'(//)|/')      match = sign.search(title.text)      file_name = title.text      if match:        file_name = sign.sub('a',str(title.text))

 

 

When you get the title, there is something wrong with it. Could you tell me why did you add a slash to the title, not only add one but also add two, this problem directly leads to an error in the file name when I write a file later, so I wrote a regular expression and changed it for you.
Finally, the essay content is obtained. Through analysis on each page, the article address is obtained, and then the content is obtained directly. Originally, I wanted to directly obtain the content one by modifying the webpage address, which saves time.

def get_content(url):  res = requests.get('https://www.sanwen.net'+url)  if res.status_code==200:    soup = BeautifulSoup(res.text,'lxml')    contents = soup.find('div',class_='content').find_all('p')    content = ''    for i in contents:      content+=i.text+'\n'    return content

 

 

The last step is to write the file and save OK.

   f = open(file_name+'.txt','w')      print 'running w txt'+file_name+'.txt'      f.write(title.text+'\n')      f.write(author+'\n')      content=get_content(url)           f.write(content)      f.close()

 

The problem is that I don't know why some essays are lost. I can only get more than 400 articles, this is a lot different from the article on the prose network, but it is indeed obtained on a one-page basis. I hope you can take a look at this issue. The webpage may not be accessible. Of course, I think it has something to do with my dormitory.

     f = open(file_name+'.txt','w')      print 'running w txt'+file_name+'.txt'      f.write(title.text+'\n')      f.write(author+'\n')      content=get_content(url)           f.write(content)      f.close()

 

I almost forgot.

The code is messy, but I never stop.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.