This article mainly introduces detailed information about the Python crawler package BeautifulSoup recursive crawling instance. For more information, see The Python crawler package BeautifulSoup recursive crawling instance.
Summary:
Crawlers primarily aim to crawl the desired content along the network. They are essentially a recursive process. They first need to obtain the content of the webpage, analyze the content of the page, find another URL, and then obtain the page content of the URL, and repeat this process.
Let's take Wikipedia as an example.
We want to extract all links to other entries in the Kaiwen Benke entry in Wikipedia.
# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date: 2016-12-25 10:35:00# @Last Modified by: HaonanWu# @Last Modified time: 2016-12-25 10:52:26from urllib2 import urlopenfrom bs4 import BeautifulSoup html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')bsObj = BeautifulSoup(html, "html.parser") for link in bsObj.findAll("a"): if 'href' in link.attrs: print link.attrs['href']
The above code can extract all the hyperlinks on the page.
/wiki/Wikipedia:Protection_policy#semi#mw-head#p-search/wiki/Kevin_Bacon_(disambiguation)/wiki/File:Kevin_Bacon_SDCC_2014.jpg/wiki/San_Diego_Comic-Con/wiki/Philadelphia/wiki/Pennsylvania/wiki/Kyra_Sedgwick
First, the extracted URL may have some duplicates.
Second, there are some URLs that we don't need, such as the sidebar, header, footer, directory bar link, and so on.
So through observation, we can find that all links to the entry page have three features:
They are all in the p tag where id is bodyContent.
The URL does not contain a colon.
URL links are all relative paths starting with/wiki/(the complete absolute path starting with http will also be crawled)
from urllib2 import urlopenfrom bs4 import BeautifulSoupimport datetimeimport randomimport re pages = set()random.seed(datetime.datetime.now())def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bsObj = BeautifulSoup(html, "html.parser") return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon")while len(links) > 0: newArticle = links[random.randint(0, len(links)-1)].attrs["href"] if newArticle not in pages: print(newArticle) pages.add(newArticle) links = getLinks(newArticle)
The getLinks parameter is/wiki/ <词条名称> And combine with the absolute path of Wikipedia to obtain the URL of the page. Use a regular expression to capture all URLs pointing to other entries and return them to the main function.
The main function calls recursive getlinks and randomly accesses a URL that has not been accessed until there is no entry or the entry is stopped.
This code can capture the entire Wikipedia
from urllib.request import urlopenfrom bs4 import BeautifulSoupimport re pages = set()def getLinks(pageUrl): global pages html = urlopen("http://en.wikipedia.org"+pageUrl) bsObj = BeautifulSoup(html, "html.parser") try: print(bsObj.h1.get_text()) print(bsObj.find(id ="mw-content-text").findAll("p")[0]) print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) except AttributeError: print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")): if 'href' in link.attrs: if link.attrs['href'] not in pages: #We have encountered a new page newPage = link.attrs['href'] print("----------------\n"+newPage) pages.add(newPage) getLinks(newPage)getLinks("")
Generally, Python has a recursion limit of 1000 times. Therefore, you need to manually set a large recursive counter, or use other means to make the code run after 1000 iterations.
Thank you for reading this article. I hope it will help you. thank you for your support for this site!
For more details about the Python crawler package BeautifulSoup recursive crawling instance, refer to the PHP Chinese website!