Python crawler BeautifulSoup recursive fetch instance detailed

Last Update:2017-02-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Profile:

The main purpose of crawlers is to crawl the content that is needed along the network. The essence of them is a recursive process. They first need to get the content of the page, then analyze the page content and find another URL, and then get the page content of the URL, repeating this process.

Let's take Wikipedia as an example.

We want to extract all the links from the Kevin Bacon entry in Wikipedia to other terms.

#-*-Coding:utf-8-*-# @Author: haonanwu# @Date:  2016-12-25 10:35:00# @Last Modified by:  haonanwu# @Last modifie D time:2016-12-25 10:52:26from urllib2 import urlopenfrom bs4 import BeautifulSoup html = Urlopen (' http://en.wikipedia.or G/wiki/kevin_bacon ') Bsobj = BeautifulSoup (HTML, "Html.parser") for link in Bsobj.findall ("a"):  if ' href ' in Link.attrs:    print link.attrs[' href '

The above code can extract all the hyperlinks on the page.

/wiki/wikipedia:protection_policy#semi#mw-head#p-search/wiki/kevin_bacon_ (disambiguation)/wiki/File:Kevin_ Bacon_sdcc_2014.jpg/wiki/san_diego_comic-con/wiki/philadelphia/wiki/pennsylvania/wiki/kyra_sedgwick

First, the extracted URLs may have some duplicate

Second, there are URLs that we don't need, such as sidebar, header, footer, catalog bar links, and so on.

So by observing, we can find that all links to the entry page have three features:

They're all in the div tag with the ID bodycontent.

URL link does not contain a colon

URL links are relative paths beginning with/wiki/(also crawling to the full absolute path with HTTP start)

From URLLIB2 import urlopenfrom bs4 import beautifulsoupimport datetimeimport randomimport re pages = set () random.seed (dat Etime.datetime.now ()) def getlinks (articleurl):  html = urlopen ("http://en.wikipedia.org" +articleurl)  bsobj = BeautifulSoup (HTML, "Html.parser")  return Bsobj.find ("div", {"id": "bodycontent"}). FindAll ("A", href= Re.compile ("^ (/wiki/) (?!:).) *$ ")) links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0:  newarticle = links[random.randint (0, Len (links) -1)].attrs["href"]  if newarticle not in pages:    print (newarticle)    Pages.Add (newarticle)    links = Getlinks (newarticle)

The Getlinks parameter is the name of the/wiki/< entry, and the URL of the page is obtained by merging with the absolute path of Wikipedia. Captures all URLs that point to other terms through regular expressions and returns them to the main function.

The main function calls the recursive getlinks and randomly accesses a URL that has not been accessed until the entry is not available or the active stop is reached.

This code can crawl the entire Wikipedia.

From urllib.request import urlopenfrom bs4 import beautifulsoupimport re pages = set () def getlinks (pageurl):  Global PA GES  html = urlopen ("http://en.wikipedia.org" +pageurl)  bsobj = BeautifulSoup (HTML, "Html.parser")  try: Print (    bsObj.h1.get_text ())    print (bsobj.find (id = "Mw-content-text"). FindAll ("P") [0])    print ( Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href '])  except Attributeerror:    print ("This page is Missing something! No worries though! ")   For link in Bsobj.findall ("A", Href=re.compile ("^ (/wiki/)")):    if ' href ' in link.attrs:      if link.attrs[' href '] Not in pages:        #We has encountered a new page        newPage = link.attrs[' href ']        print ("----------------\ n" + NewPage)        Pages.Add (newPage)        getlinks (newPage) getlinks ("")

In general, Python has a recursive limit of 1000 times, so it is necessary to artificially set a larger recursive counter, or otherwise allow the code to run after 1000 iterations.

Thank you for reading, hope to help everyone, thank you for the support of this site!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler BeautifulSoup recursive fetch instance detailed

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler BeautifulSoup recursive fetch instance detailed

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support