Python crawler BeautifulSoup recursive fetch instance detailed

Source: Internet
Author: User
Python crawler BeautifulSoup recursive fetch instance detailed

Profile:

The main purpose of crawlers is to crawl the content that is needed along the network. The essence of them is a recursive process. They first need to get the content of the page, then analyze the page content and find another URL, and then get the page content of the URL, repeating this process.

Let's take Wikipedia as an example.

We want to extract all the links from the Kevin Bacon entry in Wikipedia to other terms.

#-*-Coding:utf-8-*-# @Author: haonanwu# @Date:  2016-12-25 10:35:00# @Last Modified by:  haonanwu# @Last modifie D time:2016-12-25 10:52:26from urllib2 import urlopenfrom bs4 import BeautifulSoup html = Urlopen (' http://en.wikipedia.or G/wiki/kevin_bacon ') Bsobj = BeautifulSoup (HTML, "Html.parser") for link in Bsobj.findall ("a"):  if ' href ' in Link.attrs:    print link.attrs[' href '

The above code can extract all the hyperlinks on the page.

/wiki/wikipedia:protection_policy#semi#mw-head#p-search/wiki/kevin_bacon_ (disambiguation)/wiki/File:Kevin_ Bacon_sdcc_2014.jpg/wiki/san_diego_comic-con/wiki/philadelphia/wiki/pennsylvania/wiki/kyra_sedgwick

First, the extracted URLs may have some duplicate

Second, there are URLs that we don't need, such as sidebar, header, footer, catalog bar links, and so on.

So by observing, we can find that all links to the entry page have three features:

They're all in the div tag with the ID bodycontent.

URL link does not contain a colon

URL links are relative paths beginning with/wiki/(also crawling to the full absolute path with HTTP start)

From URLLIB2 import urlopenfrom bs4 import beautifulsoupimport datetimeimport randomimport re pages = set () random.seed (dat Etime.datetime.now ()) def getlinks (articleurl):  html = urlopen ("http://en.wikipedia.org" +articleurl)  bsobj = BeautifulSoup (HTML, "Html.parser")  return Bsobj.find ("div", {"id": "bodycontent"}). FindAll ("A", href= Re.compile ("^ (/wiki/) (?!:).) *$ ")) links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0:  newarticle = links[random.randint (0, Len (links) -1)].attrs["href"]  if newarticle not in pages:    print (newarticle)    Pages.Add (newarticle)    links = Getlinks (newarticle)

The Getlinks parameter is the name of the/wiki/< entry, and the URL of the page is obtained by merging with the absolute path of Wikipedia. Captures all URLs that point to other terms through regular expressions and returns them to the main function.

The main function calls the recursive getlinks and randomly accesses a URL that has not been accessed until the entry is not available or the active stop is reached.

This code can crawl the entire Wikipedia.

From urllib.request import urlopenfrom bs4 import beautifulsoupimport re pages = set () def getlinks (pageurl):  Global PA GES  html = urlopen ("http://en.wikipedia.org" +pageurl)  bsobj = BeautifulSoup (HTML, "Html.parser")  try: Print (    bsObj.h1.get_text ())    print (bsobj.find (id = "Mw-content-text"). FindAll ("P") [0])    print ( Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href '])  except Attributeerror:    print ("This page is Missing something! No worries though! ")   For link in Bsobj.findall ("A", Href=re.compile ("^ (/wiki/)")):    if ' href ' in link.attrs:      if link.attrs[' href '] Not in pages:        #We has encountered a new page        newPage = link.attrs[' href ']        print ("----------------\ n" + NewPage)        Pages.Add (newPage)        getlinks (newPage) getlinks ("")

In general, Python has a recursive limit of 1000 times, so it is necessary to artificially set a larger recursive counter, or otherwise allow the code to run after 1000 iterations.

Thank you for reading, hope to help everyone, thank you for the support of this site!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.