Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

Source: Internet
Author: User

BeautifulSoup Module Introduction and Installation
    • BeautifulSoup
      • BeautifulSoup is a third-party library of Python that extracts data from HTML or XML and is typically used as a parser for Web pages
      • BeautifulSoup Official website: https://www.crummy.com/software/BeautifulSoup/
      • Official documents: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
      • English Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup installation is very simple, we can directly use Pip to install BeautifulSoup, the installation command is as follows:

Pip Install Beautifulsoup4

If you use the IDE is Pycharm, the installation is simpler, directly write the statement of the import module: import bs4 , and then the error, the prompt module does not exist, and then press ALT + ENTER, error correction prompts, and finally select the installation module can be installed automatically.

After the installation is complete, write a test code:

import bs4print(bs4)

If this code is executed and the normal output does not have an error, the installation is successful.

Syntax for BeautifulSoup:

Access node Information:

Syntax format:

from bs4 import BeautifulSoupimport re# 根据HTML网页字符串内容创建BeautifulSoup对象soup = BeautifulSoup(html_doc,              # HTML文档字符串                     ‘html.parser‘,         # HTML解析器                     from_encoding=‘utf-8‘  # HTML文档的编码,在python3中不需要加上这个参数                     )# 方法:find_all(name, attrs, string)# 查找所有标签为 a 的节点soup.find_all(‘a‘)# 查找所有标签为 a 的节点,并链接符合/view/123.html形式的节点soup.find_all(‘a‘, href=‘/view/123.html‘)soup.find_all(‘a‘, href=re.compile(‘/view/\d+\.html‘))# 查找所有标签为div,class为abc,标签内容为Python的节点soup.find_all(‘div‘, class_=‘abc‘, string=‘标签内容为Python的节点‘)# 得到节点:<a href=‘1.html‘>Python</a># 获取查找到的节点的标签名称node.name# 获取查找到的a节点的href属性node[‘href‘]# 获取查找到的a节点的链接文字node.get_text()

The actual test code:

From BS4 import beautifulsoupimport rehtml_doc = "" 
Instance Crawler

After a brief look at BeautifulSoup and the installation of BeautifulSoup, we can begin to write our crawlers.

We usually need to complete the following steps to write a simple crawler:
    • Identify goals

      • Determine which Web page you want to crawl, for example, this instance is crawling with the Wikipedia-related terms page for Python and the title and introduction
    • Analyze goals

      • Parse the URL format of the destination page to avoid crawling irrelevant URLs
      • Analyze the data format to be crawled, such as the title and introduction data to be crawled in this instance
      • Analyze the encoding of the target page, or it may be garbled when parsing the Web page with the parser
    • Writing code

      • After you finish analyzing the target page, you write code to crawl the data.
    • Execution crawler
      • After the code is written, it is natural to execute the crawler to test the normal crawl of the data
Start analyzing the landing page that this instance needs to crawl:
    • Target: Baidu Encyclopedia Python entry related terms page-title and introduction
    • Entry page: https://baike.baidu.com/item/Python/407313
    • URL format:
      • Entry page url:/item/name/id or/item/name/, example:/item/c/7252092 or/item/guido%20van%20rossum
    • Data format:
      • Title Format:
        • &lt;dd class="lemmaWgt-lemmaTitle-title"&gt;&lt;h1&gt;***&lt;/h1&gt;***&lt;/dd&gt;
      • Introduction Format:
        • &lt;div class="lemma-summary" label-module="lemmaSummary"&gt;***&lt;/div&gt;
    • Page ID: UTF-8
Start writing instance code after parsing is complete
    • The crawler needs to accomplish the goal: crawl Baidu encyclopedia Python entries related to 1000 pages of data

Start by creating a project directory and creating a Python package under the directory, creating the appropriate module files under the package, such as:

    • Spider_main: Crawler Scheduler program, also main entry file
    • Url_manager:url Manager, managing and storing URLs to crawl
    • Html_downloader: Downloader for downloading the contents of the landing page
    • Html_parser: Parser, parsing downloaded Web content
    • Html_outputer: An output that outputs parsed data to a Web page or console

Crawler Scheduler Code:

"Crawler Scheduler, also main portal file" ' Import Url_manager, Html_downloader, Html_parser, Html_outputerclass spidermain (object): # initialization Each object Def __init__ (self): Self.urls = Url_manager. Urlmanager () # URL Manager self.downloader = Html_downloader. Htmldownloader () # Downloader self.parser = Html_parser. Htmlparser () # parser self.outputer = Html_outputer. Htmloutputer () # output # Crawler scheduling method Def craw (self, root_url): # Record The current crawl is the number of URLs count = 1 # will be the URL of the portal page                Add to URL Manager self.urls.add_new_url (root_url) # start crawler loop while Self.urls.has_new_url (): Try:                # get the URL to crawl new_url = Self.urls.get_new_url () # Every crawl takes a page and prints it in the console. Print ("Craw", Count, New_url) # launches the downloader to download the page content of the URL Html_cont = self.downloader.download (new _url) # Call the parser to parse the downloaded page content, get a new URL list and new data new_urls, New_data = Self.parser.parse (New_url, HTM L_cont) # Adds a new URLThe list is added to the URL Manager self.urls.add_new_urls (new_urls) # collects parsed data Self.outputer.colle                Ct_data (new_data) # stops crawling if count = = 1000:break when crawling to 1000 pages Count + = 1 except: # exception occurred while crawling output a text print ("Craw failed") in the console # output processed Data self.outputer.output_html () # Determines if this module is executed as a portal file if __name__ = = "main": # URL of the target portal page Root_url = "Https://baike . baidu.com/item/python/407313 "Obj_spider = Spidermain () # boot crawler obj_spider.craw (root_url)

URL Manager code:

The URL manager, which manages and stores the URLs to be crawled. The URL manager needs to maintain two lists, one is the list of URLs to crawl, and the other is the list of crawled URLs.   ' Class Urlmanager (object): Def __init__ (self): Self.new_urls = set () # List of URLs to crawl self.old_urls = set () # crawled URL list def add_new_url (self, URL): ' Adds a new URL to the manager, which is the URL to crawl:p Aram URL: New Url:re Turn: ' # URL is empty then End if URL is none:return # The URL is not in the two list is the new URL if URL not In Self.new_urls and URL not in Self.old_urls:self.new_urls.add (URL) def add_new_urls (self, URLs): '            "Add new URLs to manager in bulk:p Aram URLs: New URL list: return: ' If URLs is None or len ' = = 0: Return for the URL in Urls:self.add_new_url (URL) def has_new_url (self): "' Judging Tube        Url:return:True or False "return len (self.new_urls)! = 0 def get_new_url (self) in the manager:    "' Gets a url:return to crawl from the URL Manager: Returns a URL to crawl    "# Stack a URL and add the URL to the crawled list new_url = Self.new_urls.pop () self.old_urls.add (New_url) re Turn New_url

Download code:

‘‘‘    下载器,用于下载目标网页的内容‘‘‘from urllib import requestclass HtmlDownloader(object):    def download(self, url):        ‘‘‘        下载url地址的页面内容        :param url: 需要下载的url        :return: 返回None或者页面内容        ‘‘‘        if url is None:            return None        response = request.urlopen(url)        if response.getcode() != 200:            return None        return response.read()

Parser Code:

' Parser, parse downloaded Web content ' ' Import reimport urllib.parsefrom bs4 import beautifulsoupclass Htmlparser (object): Def parse (SE        LF, Page_url, Html_cont): ' Parse downloaded Web content:p Aram page_url: page URL:p aram html_cont: Web content : return: Returns the new URL list and parsed data "if Page_url is None or html_cont is none:return soup = Beau Tifulsoup (Html_cont, ' html.parser ') New_urls = Self._get_new_urls (page_url, soup) new_data = Self._get_new_d ATA (Page_url, soup) return new_urls, New_data def _get_new_urls (self, Page_url, soup): "' Get new URL List:p Aram Page_url::p Aram Soup:: return: ' New_urls = set () # entry page url:/item/n Ame/id or/item/name/, example:/item/c/7252092 or/item/guido%20van%20rossum links = soup.find_all (' A ', Href=re.compile (r ")  /item/(. *)))) (for link in links:new_url = link[' href ') # stitching into full URL New_full_url = Urllib.parse.urljoin (PAGe_url, New_url) New_urls.add (New_full_url) return New_urls def _get_new_data (self, Page_url, soup): "' Parse the data and return the parsed data:p Aram Page_url::p Aram Soup:: Return: ' # Use a dictionary to store parsing After the data Res_data = {} # URL res_data[' url '] = page_url # title label format: <dd class= "Lemmawgt-lemmatitle -title ">

Output code:

"' output to output parsed data to a Web page ' class Htmloutputer (object): Def __init__ (self): # store parsed data Self.datas = [] def collect_data (self, data): "Collects data:p Aram:: Return:" If data is None : Return Self.datas.append (data) def output_html (self): "The data collected is exported in HTML format to the HTML file, which I Bootstrap:return is used: ' Fout = open (' output.html ', ' W ', encoding= ' Utf-8 ') fout.write ("<! DOCTYPE html> ") fout.write (" 

Operating effect:

Console output:

Generated HTML file:

At this point, we have a simple crawler to complete.

SOURCE GitHub Address:

Https://github.com/Binary-ZeroOne/easy-spider

Use Python's BeautifulSoup library to implement a crawler that can crawl 1000 of Baidu encyclopedia data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.