Through so many days of the introduction of reptiles, we have some knowledge of reptiles, today we will introduce a simple crawler technology architecture, explain the crawler technology architecture of several modules, the following crawler is also the extension of today's architecture, but this architecture is simple to achieve, the optimization, crawling method is not very perfect, Mainly in order to facilitate the understanding of reptiles and later programming.
1 Infrastructure and processes
The Simple crawler architecture consists of the following components:
Crawler Scheduler: Overall coordination of the work of several other modules
URL Manager: Responsible for managing URLs, maintaining crawled URL collections and non-crawled URL collections
Web Downloader: Download of URLs that are not crawled
Web parser: Parse the downloaded HTML and extract the new URL from it to the URL manager, the data is given to the memory processing
Data memory: Access to data parsed by HTML
The architecture diagram is as follows:
The crawler flowchart is as follows:
Let's split each part separately.
This time we took the encyclopedia search term to demonstrate, crawl the encyclopedia content title and summary information, and if there is a link in the summary, also will download the title summary of the connection. Such as:
2 URL Manager
Basic Features:
- Determine if a URL to crawl is needed
- Add a new URL to the collection of URLs to crawl
- Get URLs that are not crawled
- Get the collection size to crawl
- Get Crawl Collection size
- Moves the crawl completion URL from the collection of crawled URLs to the crawled URL collection.
The URL Manager also uses the set redo function to prevent the program from entering a dead loop. Large Internet companies, due to the high performance of the cache database, typically store URLs in the cache database. Small companies typically store URLs in memory and store them in a relational database if they want to store them permanently.
URL Manager Code
#-*-Coding:utf-8-*-class urlmanager: "" "" "" "" "" Def __init__ (self): "" Initialize not crawl collection New_url S and crawled collection Old_urls "" "# Python's set is similar to other languages, is an unordered set of distinct elements Self.new_urls = set () Self.old_urls = set () def has_new_url (self): "" "to determine if there are url:return that have not been crawled:" "Return Self.new_url_size ()! = 0 def get_new_url (self): "" "gets the Url:return that is not crawled:" "New_url = Self.new_urls.pop () Self.old_urls.add (New_url) return New_url def add_new_url (self, URL): "" adds a new URL to the non-crawled URL set Closed: return: "" "If URL is None:return # to determine if URL is in New_urls or old_urls if URL n OT in Self.new_urls and URL is self.old_urls:self.new_urls.add (URL) def add_new_urls (self, URLs): "" "adds a new URL to a collection that is not crawled:p Aram Urls:url collection: Return:" "" If URL is None or len (urls) = = 0: Return # Loop will fetch the URL into new_urls for the URL in Urls:self.add_new_url (URL) def new_url_size (self): "" " Get no Crawl Collection size: return: "" "Return Len (self.new_urls) def old_url_size (self):" "gets crawled Fetch collection Size: return: "" "Return Len (Self.old_urls)
HTML Downloader
HTML downloader is relatively simple, just get the HTML content through requests (note: To use the session, or will be reported toomanyredirects exception), the specific look at the code:
# -*- coding: utf-8 -*-import requestsclass HtmlDownload: """ HTML下载器类 """ def download(self, url): """ 下载html :param url:根据url下载html内容 :return: 返回html """ # 创建session对象,这里一定要用session,不然会报TooManyRedirects异常 s = requests.session() # 添加头部信息 s.headers[‘User-Agent‘] = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0‘ # 获取内容 resp = s.get(url) # 要以content字节形式返回 return resp.content.decode()
The
HTML parser
First parses the Web page you want to crawl:
As you can see from the above analysis, we need the title in <dd class= "Lemmawgt-lemmatitle-title" > tab, the summary information is in <div class= "para" label-module= "para" > tag, where <div class= "para" label-module= "para" > The herf of a tag in the tag is the relative URL we need to continue crawling through the above analysis can see that we need the title in <dd class= "Lemmawgt-lemmatitle-title" > tag, summary information in <div class = "Para" label-module= "para" > tag, where <div class= "para" label-module= "para" > tag of the A tag herf is the relative URL we need to continue crawling.
#-*-Coding:utf-8-*-from bs4 import beautifulsoupclass htmlparse: "" "HTML Parser Class" "Def __init__ (self, Baseu Rl= ' https://baike.baidu.com '): "" "initializes the underlying URL:p aram baseurl: Base url" "" Self.baseurl = Bas Eurl def parse_data (self, page): "" "gets Web page data:p Aram page: Gets the Html:return: data to be stored after parsing" " # Create Instance soup = BeautifulSoup (page, ' lxml ') # gets the title = Soup.find (' dd ', class_= ' Lemmawgt-le Mmatitle-title '). Find (' H1 '). String # Gets the digest summary = Soup.find (' div ', class_= ' para '). Text # Full Data information data = title + summary Return Data def parse_url (self, page): "" "gets the Web page URL:p Aram page: Get Html:return: Parse after get new URL "" # Create Instance soup = BeautifulSoup (page, ' lxml ') # all <div C Lass= "para" label-module= "para" > in a tag anodes = Soup.find (' div ', class_= ' para '). Find_all (' a ') # New_url set Set new_urls = Set (# loop Gets the relative path (href), with the BaseURL to complete the URL for anode in Anodes:link = Anode.get (' href ') # full path FullUrl = self.baseurl + link # added to non-crawl collection New_urls New_urls.add (fullurl) return NE W_urls
Data memory
The parsed data will be appended to the file directly.
# -*- coding: utf-8 -*-class DataStore: """ 数据存储器类 """ def store_data(self, data, name=‘baike.txt‘): """ 将获取的数据存储到文件中 :param data: 解析的数据 :param name: 本地文件名 :return: """ with open(name, ‘a‘, encoding=‘utf-8‘) as fp: fp.write(‘\r\n‘) fp.write(data)
Crawler Scheduler
Crawler Scheduler is mainly responsible for the scheduling of the above several modules
#-*-coding:utf-8-*-# import each module from the spider. Datastore import Datastorefrom Spider. Htmldownload import Htmldownloadfrom Spider. Htmlparse import Htmlparsefrom Spider. Urlmanager Import urlmanagerclass Spidermain: "" "Crawler Scheduler Class" "" Def __init__ (self): "" Initialize each module "" "Self.manager = Urlmanager () Self.download = Htmldownload () Self.parse = Htmlparse () sel F.output = DataStore () def spider (Self, URL): "" "Crawler Main program:p Aram URL: initial url:return:" " "# Add the initial URL to the New_urls self.manager.add_new_url (URL) # through the while loop to get whether there are new URLs to crawl, crawl 5 to end, to prevent the cycle of death While Self.manager.has_new_url () and Self.manager.old_url_size () < 5:try: # get URL New_url = Self.manager.get_new_url () # gets HTML HTML = self.download.download (New_ur L) # After parsing the new URL New_urls = Self.parse.parse_url (HTML) # after parsing to storedata = Self.parse.parse_data (HTML) # Add the parsed new URL to the New_urls set SELF.MANAGER.A Dd_new_urls (new_urls) # Save data print ('%s ' has been crawled ') self.output.store_data % self.manager.old_url_size ()) except Exception as E:print (' failed ') print (e) if __name__ = = ' __main__ ': # Instantiate Crawler Scheduler class Spidermain = Spidermain () # Enter the initial URL to crawl Spidermain.spider (' https://baike.b Aidu.com/item/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab ')
Run results (crawl only 5):
已经抓取1 个链接已经抓取2 个链接已经抓取3 个链接已经抓取4 个链接已经抓取5 个链接
Well, today's crawler-based framework is here.
Part of this article reference: Fan Python crawler development and project combat
Operations and Learning Python crawler Intermediate (vi) Base crawler