Python Crawl Article Example tutorial

Source: Internet
Author: User
This article mainly introduces the use of Python crawling prose web article related information, the article introduced in very detailed, for everyone has a certain reference learning value, the need for friends below to see it together.

This article is mainly about the Python crawl prose web article related content, share it for everyone to reference the study, the following to see a detailed introduction:

As follows:


Configure Python 2.7


BS4 requests

Install with PIP for installationsudo pip install bs4


sudo pip install requests

Briefly explain the use of BS4 because it's crawling the web, so I'll introduce find and Find_all.

The difference between find and Find_all is that it returns something different. Find returns the first tag and the contents of the tag.

Find_all returns a list

For example, we write a test.html to test the difference between find and Find_all.

The content is:


Then the test.py code is:


From BS4 import beautifulsoupimport lxmlif __name__== ' __main__ ': s = beautifulsoup (open (' test.html '), ' lxml ') print S.prettify () print "------------------------------" Print s.find (' P ') print S.find_all (' P ') print "------------------- -----------"Print s.find (' P ', id= ' one ') print S.find_all (' P ', id= ' one ') print"------------------------------"Print S.find (' P ', id= ") Print S.find_all (' P ', id=", ") print"------------------------------"Print s.find (' P ', id=" Three ") Print S.find_all (' P ', id=" three ") print"------------------------------"Print s.find (' P ', id=" four ") print S.find_all (' P ', id= "four") print "------------------------------"

We can see the results when we get to the specified label. When you get a set of labels, the difference between them is displayed.


So we need to pay attention to what is in use, otherwise there will be an error

The next step is to get the Web information through requests, I don't quite understand why people write heard and other things.

I go directly to Web Access, get the prose web by getting a few categories of the two-level Web page and then through a group of tests, put all the pages crawled again


Def get_html (): url = "https://www.sanwen.net/" two_html = [' Sanwen ', ' shige ', ' Zawen ', ' Suibi ', ' Rizhi ', ' novel '] for doc in Two_html:i=1  if doc== ' Sanwen ':  print "Running Sanwen-----------------------------"  if doc== ' Shige ':  print "Running Shige------------------------------"  if doc== ' Zawen ':  print ' Running Zawen------------- ------------------'  if doc== ' Suibi ':  print ' running Suibi-------------------------------'  if doc== ' Rizhi ':  print ' running Ruzhi-------------------------------'  if doc== ' Nove ':  print ' running Xiaoxiaoshuo-------------------------' while (i<10): par = {' P ': i} res = Requests.get (url+doc+ '/', params=par) if res.status_code==200:  Soup (res.text)  i+=i

In this part of the code I do not res.status_code deal with not 200, causing the problem is that the error will not be displayed, crawling content will be lost. And then analyze the web pages of prose, found to be www.sanwen.net/rizhi/&p=1

P Maximum is 10 this is not very understanding, the last crawl is a lot of 100 pages, forget it later analysis. The contents of each page are then obtained through the Get method.

Get each page content is the analysis of the author and the topic of the code is this


def soup (html_text): s = BeautifulSoup (Html_text, ' lxml ') link = s.find (' P ', class_= ' categorylist '). Find_all (' Li ') for I In Link:if i!=s.find (' Li ', class_= ' page '): title = I.find_all (' a ') [1] author = i.find_all (' a ') [2].text url = title.attrs[' href '] sign = Re.compile (R ' (//) |/') match = Sign.search (title.text) file_name = title.text if Match:file_name = Sign.sub ( ' A ', str (title.text))

Get the title when the pit dad, ask the big boys to write prose you title plus slash why, not only add a plus two, this problem directly caused me to write files later when the file name error, so write regular expression, I gave you a row.

The last is to obtain the prose content, through the analysis of each page, get the article address, and then directly get the content, originally also want to directly through the web address of a one to obtain it, so it is also convenient.


def get_content (URL): res = requests.get (' https://www.sanwen.net ' +url) if Res.status_code==200:soup = BeautifulSoup ( Res.text, ' lxml ') contents = soup.find (' P ', class_= ' content '). Find_all (' p ') content = ' For I in Contents:content+=i.text + ' \ n ' return content

Finally, write file save ok


f = open (file_name+ '. txt ', ' W ') print ' Running W txt ' +file_name+ '. txt ' f.write (title.text+ ' \ n ') f.write (author+ ' \ n ') Content=get_content (URL)  f.write (content) F.close ()

Three functions to get prose web prose, but there are problems, the problem is that I do not know why some prose lost I can only get to about 400 articles, this prose net article is a lot of difference, but it is a page by page of the acquisition, this problem hope the big guy to help see. Probably should do the Web page inaccessible processing, of course, I think with my dorm this broken net has relations


f = open (file_name+ '. txt ', ' W ') print ' Running W txt ' +file_name+ '. txt ' f.write (title.text+ ' \ n ') f.write (author+ ' \ n ') Content=get_content (URL)  f.write (content) F.close ()

I almost forgot.


Can appear timeout phenomenon, can only say that the university must choose the net good Ah!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.