Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

Last Update:2018-02-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

BeautifulSoup Module Introduction and Installation

BeautifulSoup
- BeautifulSoup is a third-party library of Python that extracts data from HTML or XML and is typically used as a parser for Web pages
- BeautifulSoup Official website: https://www.crummy.com/software/BeautifulSoup/
- Official documents: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- English Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup installation is very simple, we can directly use Pip to install BeautifulSoup, the installation command is as follows:

Pip Install Beautifulsoup4

If you use the IDE is Pycharm, the installation is simpler, directly write the statement of the import module: import bs4 , and then the error, the prompt module does not exist, and then press ALT + ENTER, error correction prompts, and finally select the installation module can be installed automatically.

After the installation is complete, write a test code:

import bs4print(bs4)

If this code is executed and the normal output does not have an error, the installation is successful.

Syntax for BeautifulSoup:

Access node Information:

Syntax format:

from bs4 import BeautifulSoupimport re# 根据HTML网页字符串内容创建BeautifulSoup对象soup = BeautifulSoup(html_doc,              # HTML文档字符串                     ‘html.parser‘,         # HTML解析器                     from_encoding=‘utf-8‘  # HTML文档的编码，在python3中不需要加上这个参数                     )# 方法：find_all(name, attrs, string)# 查找所有标签为 a 的节点soup.find_all(‘a‘)# 查找所有标签为 a 的节点，并链接符合/view/123.html形式的节点soup.find_all(‘a‘, href=‘/view/123.html‘)soup.find_all(‘a‘, href=re.compile(‘/view/\d+\.html‘))# 查找所有标签为div，class为abc，标签内容为Python的节点soup.find_all(‘div‘, class_=‘abc‘, string=‘标签内容为Python的节点‘)# 得到节点：<a href=‘1.html‘>Python</a># 获取查找到的节点的标签名称node.name# 获取查找到的a节点的href属性node[‘href‘]# 获取查找到的a节点的链接文字node.get_text()

The actual test code:

From BS4 import beautifulsoupimport rehtml_doc = ""

   Instance Crawler
After a brief look at BeautifulSoup and the installation of BeautifulSoup, we can begin to write our crawlers.
We usually need to complete the following steps to write a simple crawler: 
  
   
   Identify goals 
     
     Determine which Web page you want to crawl, for example, this instance is crawling with the Wikipedia-related terms page for Python and the title and introduction 
     
   Analyze goals 
     
     Parse the URL format of the destination page to avoid crawling irrelevant URLs 
     Analyze the data format to be crawled, such as the title and introduction data to be crawled in this instance 
     Analyze the encoding of the target page, or it may be garbled when parsing the Web page with the parser 
     
   Writing code 
     
     After you finish analyzing the target page, you write code to crawl the data. 
     
   Execution crawler 
     
     After the code is written, it is natural to execute the crawler to test the normal crawl of the data 
     
   
    Start analyzing the landing page that this instance needs to crawl: 
  
   
   Target: Baidu Encyclopedia Python entry related terms page-title and introduction 
   Entry page: https://baike.baidu.com/item/Python/407313 
   URL format: 
     
     Entry page url:/item/name/id or/item/name/, example:/item/c/7252092 or/item/guido%20van%20rossum 
     
   Data format: 
     
     Title Format: 
       
       &lt;dd class="lemmaWgt-lemmaTitle-title"&gt;&lt;h1&gt;***&lt;/h1&gt;***&lt;/dd&gt; 
       
     Introduction Format: 
       
       &lt;div class="lemma-summary" label-module="lemmaSummary"&gt;***&lt;/div&gt; 
       
     
   Page ID: UTF-8 
   
 Start writing instance code after parsing is complete 
  
   
   The crawler needs to accomplish the goal: crawl Baidu encyclopedia Python entries related to 1000 pages of data 
   
 
Start by creating a project directory and creating a Python package under the directory, creating the appropriate module files under the package, such as:
 
  
   
   Spider_main: Crawler Scheduler program, also main entry file 
   Url_manager:url Manager, managing and storing URLs to crawl 
   Html_downloader: Downloader for downloading the contents of the landing page 
   Html_parser: Parser, parsing downloaded Web content 
   Html_outputer: An output that outputs parsed data to a Web page or console 
   
 
Crawler Scheduler Code:
"Crawler Scheduler, also main portal file" ' Import Url_manager, Html_downloader, Html_parser, Html_outputerclass spidermain (object): # initialization Each object Def __init__ (self): Self.urls = Url_manager. Urlmanager () # URL Manager self.downloader = Html_downloader. Htmldownloader () # Downloader self.parser = Html_parser. Htmlparser () # parser self.outputer = Html_outputer. Htmloutputer () # output # Crawler scheduling method Def craw (self, root_url): # Record The current crawl is the number of URLs count = 1 # will be the URL of the portal page                Add to URL Manager self.urls.add_new_url (root_url) # start crawler loop while Self.urls.has_new_url (): Try:                # get the URL to crawl new_url = Self.urls.get_new_url () # Every crawl takes a page and prints it in the console. Print ("Craw", Count, New_url) # launches the downloader to download the page content of the URL Html_cont = self.downloader.download (new _url) # Call the parser to parse the downloaded page content, get a new URL list and new data new_urls, New_data = Self.parser.parse (New_url, HTM L_cont) # Adds a new URLThe list is added to the URL Manager self.urls.add_new_urls (new_urls) # collects parsed data Self.outputer.colle                Ct_data (new_data) # stops crawling if count = = 1000:break when crawling to 1000 pages Count + = 1 except: # exception occurred while crawling output a text print ("Craw failed") in the console # output processed Data self.outputer.output_html () # Determines if this module is executed as a portal file if __name__ = = "main": # URL of the target portal page Root_url = "Https://baike . baidu.com/item/python/407313 "Obj_spider = Spidermain () # boot crawler obj_spider.craw (root_url)
 URL Manager code: 
 
The URL manager, which manages and stores the URLs to be crawled. The URL manager needs to maintain two lists, one is the list of URLs to crawl, and the other is the list of crawled URLs.   ' Class Urlmanager (object): Def __init__ (self): Self.new_urls = set () # List of URLs to crawl self.old_urls = set () # crawled URL list def add_new_url (self, URL): ' Adds a new URL to the manager, which is the URL to crawl:p Aram URL: New Url:re Turn: ' # URL is empty then End if URL is none:return # The URL is not in the two list is the new URL if URL not In Self.new_urls and URL not in Self.old_urls:self.new_urls.add (URL) def add_new_urls (self, URLs): '            "Add new URLs to manager in bulk:p Aram URLs: New URL list: return: ' If URLs is None or len ' = = 0: Return for the URL in Urls:self.add_new_url (URL) def has_new_url (self): "' Judging Tube        Url:return:True or False "return len (self.new_urls)! = 0 def get_new_url (self) in the manager:    "' Gets a url:return to crawl from the URL Manager: Returns a URL to crawl    "# Stack a URL and add the URL to the crawled list new_url = Self.new_urls.pop () self.old_urls.add (New_url) re Turn New_url
Download code:
‘‘‘    下载器，用于下载目标网页的内容‘‘‘from urllib import requestclass HtmlDownloader(object):    def download(self, url):        ‘‘‘        下载url地址的页面内容        :param url: 需要下载的url        :return: 返回None或者页面内容        ‘‘‘        if url is None:            return None        response = request.urlopen(url)        if response.getcode() != 200:            return None        return response.read()
Parser Code:
' Parser, parse downloaded Web content ' ' Import reimport urllib.parsefrom bs4 import beautifulsoupclass Htmlparser (object): Def parse (SE        LF, Page_url, Html_cont): ' Parse downloaded Web content:p Aram page_url: page URL:p aram html_cont: Web content : return: Returns the new URL list and parsed data "if Page_url is None or html_cont is none:return soup = Beau Tifulsoup (Html_cont, ' html.parser ') New_urls = Self._get_new_urls (page_url, soup) new_data = Self._get_new_d ATA (Page_url, soup) return new_urls, New_data def _get_new_urls (self, Page_url, soup): "' Get new URL List:p Aram Page_url::p Aram Soup:: return: ' New_urls = set () # entry page url:/item/n Ame/id or/item/name/, example:/item/c/7252092 or/item/guido%20van%20rossum links = soup.find_all (' A ', Href=re.compile (r ")  /item/(. *)))) (for link in links:new_url = link[' href ') # stitching into full URL New_full_url = Urllib.parse.urljoin (PAGe_url, New_url) New_urls.add (New_full_url) return New_urls def _get_new_data (self, Page_url, soup): "' Parse the data and return the parsed data:p Aram Page_url::p Aram Soup:: Return: ' # Use a dictionary to store parsing After the data Res_data = {} # URL res_data[' url '] = page_url # title label format: <dd class= "Lemmawgt-lemmatitle -title ">
 Output code: 
 
"' output to output parsed data to a Web page ' class Htmloutputer (object): Def __init__ (self): # store parsed data Self.datas = [] def collect_data (self, data): "Collects data:p Aram:: Return:" If data is None : Return Self.datas.append (data) def output_html (self): "The data collected is exported in HTML format to the HTML file, which I Bootstrap:return is used: ' Fout = open (' output.html ', ' W ', encoding= ' Utf-8 ') fout.write ("<! DOCTYPE html> ") fout.write (" 
Operating effect:
Console output:
Generated HTML file:
At this point, we have a simple crawler to complete.
SOURCE GitHub Address:
 
    
     
     Https://github.com/Binary-ZeroOne/easy-spider 
     
   
Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support