Python crawler BeautifulSoup recursive fetch instance detailed
Profile:
The main purpose of crawlers is to crawl the content that is needed along the network. The essence of them is a recursive process. They first need to get the content of the page, then analyze the page content and find another URL, and then get the page content of the URL, repeating this process.
Let's take Wikipedia as an example.
We want to extract all the links from the Kevin Bacon entry in Wikipedia to other terms.
#-*-Coding:utf-8-*-# @Author: haonanwu# @Date: 2016-12-25 10:35:00# @Last Modified by: haonanwu# @Last modifie D time:2016-12-25 10:52:26from urllib2 import urlopenfrom bs4 import BeautifulSoup html = Urlopen (' http://en.wikipedia.or G/wiki/kevin_bacon ') Bsobj = BeautifulSoup (HTML, "Html.parser") for link in Bsobj.findall ("a"): if ' href ' in Link.attrs: print link.attrs[' href '
The above code can extract all the hyperlinks on the page.
/wiki/wikipedia:protection_policy#semi#mw-head#p-search/wiki/kevin_bacon_ (disambiguation)/wiki/File:Kevin_ Bacon_sdcc_2014.jpg/wiki/san_diego_comic-con/wiki/philadelphia/wiki/pennsylvania/wiki/kyra_sedgwick
First, the extracted URLs may have some duplicate
Second, there are URLs that we don't need, such as sidebar, header, footer, catalog bar links, and so on.
So by observing, we can find that all links to the entry page have three features:
They're all in the div tag with the ID bodycontent.
URL link does not contain a colon
URL links are relative paths beginning with/wiki/(also crawling to the full absolute path with HTTP start)
From URLLIB2 import urlopenfrom bs4 import beautifulsoupimport datetimeimport randomimport re pages = set () random.seed (dat Etime.datetime.now ()) def getlinks (articleurl): html = urlopen ("http://en.wikipedia.org" +articleurl) bsobj = BeautifulSoup (HTML, "Html.parser") return Bsobj.find ("div", {"id": "bodycontent"}). FindAll ("A", href= Re.compile ("^ (/wiki/) (?!:).) *$ ")) links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0: newarticle = links[random.randint (0, Len (links) -1)].attrs["href"] if newarticle not in pages: print (newarticle) Pages.Add (newarticle) links = Getlinks (newarticle)
The Getlinks parameter is the name of the/wiki/< entry, and the URL of the page is obtained by merging with the absolute path of Wikipedia. Captures all URLs that point to other terms through regular expressions and returns them to the main function.
The main function calls the recursive getlinks and randomly accesses a URL that has not been accessed until the entry is not available or the active stop is reached.
This code can crawl the entire Wikipedia.
From urllib.request import urlopenfrom bs4 import beautifulsoupimport re pages = set () def getlinks (pageurl): Global PA GES html = urlopen ("http://en.wikipedia.org" +pageurl) bsobj = BeautifulSoup (HTML, "Html.parser") try: Print ( bsObj.h1.get_text ()) print (bsobj.find (id = "Mw-content-text"). FindAll ("P") [0]) print ( Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href ']) except Attributeerror: print ("This page is Missing something! No worries though! ") For link in Bsobj.findall ("A", Href=re.compile ("^ (/wiki/)")): if ' href ' in link.attrs: if link.attrs[' href '] Not in pages: #We has encountered a new page newPage = link.attrs[' href '] print ("----------------\ n" + NewPage) Pages.Add (newPage) getlinks (newPage) getlinks ("")
In general, Python has a recursive limit of 1000 times, so it is necessary to artificially set a larger recursive counter, or otherwise allow the code to run after 1000 iterations.
Thank you for reading, hope to help everyone, thank you for the support of this site!