Python crawler package BeautifulSoup recursive crawling instance details

Source: Internet
Author: User
This article mainly introduces detailed information about the Python crawler package BeautifulSoup recursive crawling instance. For more information, see The Python crawler package BeautifulSoup recursive crawling instance.

Summary:

Crawlers primarily aim to crawl the desired content along the network. They are essentially a recursive process. They first need to obtain the content of the webpage, analyze the content of the page, find another URL, and then obtain the page content of the URL, and repeat this process.

Let's take Wikipedia as an example.

We want to extract all links to other entries in the Kaiwen Benke entry in Wikipedia.

# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date:  2016-12-25 10:35:00# @Last Modified by:  HaonanWu# @Last Modified time: 2016-12-25 10:52:26from urllib2 import urlopenfrom bs4 import BeautifulSoup html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')bsObj = BeautifulSoup(html, "html.parser") for link in bsObj.findAll("a"):  if 'href' in link.attrs:    print link.attrs['href']

The above code can extract all the hyperlinks on the page.

/wiki/Wikipedia:Protection_policy#semi#mw-head#p-search/wiki/Kevin_Bacon_(disambiguation)/wiki/File:Kevin_Bacon_SDCC_2014.jpg/wiki/San_Diego_Comic-Con/wiki/Philadelphia/wiki/Pennsylvania/wiki/Kyra_Sedgwick

First, the extracted URL may have some duplicates.

Second, there are some URLs that we don't need, such as the sidebar, header, footer, directory bar link, and so on.

So through observation, we can find that all links to the entry page have three features:

They are all in the p tag where id is bodyContent.

The URL does not contain a colon.

URL links are all relative paths starting with/wiki/(the complete absolute path starting with http will also be crawled)

from urllib2 import urlopenfrom bs4 import BeautifulSoupimport datetimeimport randomimport re pages = set()random.seed(datetime.datetime.now())def getLinks(articleUrl):  html = urlopen("http://en.wikipedia.org"+articleUrl)  bsObj = BeautifulSoup(html, "html.parser")  return bsObj.find("p", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon")while len(links) > 0:  newArticle = links[random.randint(0, len(links)-1)].attrs["href"]  if newArticle not in pages:    print(newArticle)    pages.add(newArticle)    links = getLinks(newArticle)

The getLinks parameter is/wiki/ <词条名称> And combine with the absolute path of Wikipedia to obtain the URL of the page. Use a regular expression to capture all URLs pointing to other entries and return them to the main function.

The main function calls recursive getlinks and randomly accesses a URL that has not been accessed until there is no entry or the entry is stopped.

This code can capture the entire Wikipedia

from urllib.request import urlopenfrom bs4 import BeautifulSoupimport re pages = set()def getLinks(pageUrl):  global pages  html = urlopen("http://en.wikipedia.org"+pageUrl)  bsObj = BeautifulSoup(html, "html.parser")  try:    print(bsObj.h1.get_text())    print(bsObj.find(id ="mw-content-text").findAll("p")[0])    print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])  except AttributeError:    print("This page is missing something! No worries though!")   for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):    if 'href' in link.attrs:      if link.attrs['href'] not in pages:        #We have encountered a new page        newPage = link.attrs['href']        print("----------------\n"+newPage)        pages.add(newPage)        getLinks(newPage)getLinks("")

Generally, Python has a recursion limit of 1000 times. Therefore, you need to manually set a large recursive counter, or use other means to make the code run after 1000 iterations.

Thank you for reading this article. I hope it will help you. thank you for your support for this site!

For more details about the Python crawler package BeautifulSoup recursive crawling instance, refer to the PHP Chinese website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.