This article describes Python web crawler instances, crawler definitions, main frameworks, and other basic concepts in detail. For more information, see Python and web crawler.
1. crawler Definition
Crawler: a program that automatically captures Internet data.
2. Main crawler frameworks
Shows the main framework of the crawler. The crawler dispatcher obtains the URL link to be crawled through the URL manager. If the URL Manager contains the URL link to be crawled, the crawler scheduler calls the webpage download server to download the corresponding webpage, then calls the webpage parser to parse the webpage, and adds the new URL in the webpage to the URL manager to output valuable data.
3. crawler timeline
4. URL Manager
The URL Manager manages the URL set to be crawled and the URL set to be crawled to prevent repeated crawling and circular crawling. Shows the main functions of the URL Manager:
In terms of implementation, the URL manager mainly uses memory (set), and relational database (MySQL) in Python ). Small programs are generally implemented in memory. The set () Type Built in Python can automatically determine whether the elements are repeated. For larger programs, databases are generally used for implementation.
5. webpage download Tool
The webpage download tool in Python mainly uses the urllib library, which is a built-in python module. For the urllib2 library in version 2. x, it is integrated into urllib in python3.x and in its request and other sub-modules. The urlopen function in urllib is used to open a url and obtain url data. The parameters of the urlopen function can be url links or request objects. For simple web pages, it is sufficient to directly use url strings for parameters. However, for complex web pages, when you use the urlopen function for a webpage with anti-crawler mechanism, you need to add an http header. For webpages with logon mechanisms, you need to set cookies.
6. Web parser
The Web parser extracts valuable data and new URLs from the url data downloaded by the web download tool. You can use regular expressions and BeautifulSoup to extract data. Regular Expressions use string-based fuzzy match, which has a good effect on the target data with distinctive features, but the versatility is not high. BeautifulSoup is a third-party module for structured resolution of url content. Parse the downloaded webpage content into a DOM tree, which is part of the output of a webpage in Baidu encyclopedia that is crawled by using BeautifulSoup.
For detailed use of BeautifulSoup, write it later. The following code uses python to capture other League of legends-related entries in the league of legends entries in Baidu encyclopedia and save these entries in the new excel file. Code:
From bs4 import BeautifulSoup import re import xlrd import xlwt from urllib. request import urlopen excelFile = xlwt. workbook () sheet = excelFile. add_sheet ('league of legend') # Baidu Encyclopedia: league of legends # html = urlopen ("http://baike.baidu.com/subview/3049782/11262116.htm") bsObj = BeautifulSoup (html. read (), "html. parser ") # print (bsObj. pretiterator () row = 0 for node in bsObj. find ("p", {"class": "main-content "}). findAll ("p", {"class": "para"}): links = node. findAll ("a", href = re. compile ("^ (/view/) [0-9] + \. htm $ ") for link in links: if 'href 'in link. attrs: print (link. attrs ['href '], link. get_text () sheet. write (row, 0, link. attrs ['href ']) sheet. write (row, 1, link. get_text () row = row + 1 excelFile. save ('e: \ Project \ Python \ lol.xls ')
The output part is as follows:
The excel section is as follows:
The above is all of the content in this article, hoping to help you learn Python web crawler.