Introduction to Python web crawler instances

Source: Internet
Author: User
Tags python web crawler
This article describes Python web crawler instances, crawler definitions, main frameworks, and other basic concepts in detail. For more information, see Python and web crawler.

1. crawler Definition

Crawler: a program that automatically captures Internet data.

2. Main crawler frameworks

Shows the main framework of the crawler. The crawler dispatcher obtains the URL link to be crawled through the URL manager. If the URL Manager contains the URL link to be crawled, the crawler scheduler calls the webpage download server to download the corresponding webpage, then calls the webpage parser to parse the webpage, and adds the new URL in the webpage to the URL manager to output valuable data.

3. crawler timeline


4. URL Manager

The URL Manager manages the URL set to be crawled and the URL set to be crawled to prevent repeated crawling and circular crawling. Shows the main functions of the URL Manager:


In terms of implementation, the URL manager mainly uses memory (set), and relational database (MySQL) in Python ). Small programs are generally implemented in memory. The set () Type Built in Python can automatically determine whether the elements are repeated. For larger programs, databases are generally used for implementation.

5. webpage download Tool

The webpage download tool in Python mainly uses the urllib library, which is a built-in python module. For the urllib2 library in version 2. x, it is integrated into urllib in python3.x and in its request and other sub-modules. The urlopen function in urllib is used to open a url and obtain url data. The parameters of the urlopen function can be url links or request objects. For simple web pages, it is sufficient to directly use url strings for parameters. However, for complex web pages, when you use the urlopen function for a webpage with anti-crawler mechanism, you need to add an http header. For webpages with logon mechanisms, you need to set cookies.

6. Web parser

The Web parser extracts valuable data and new URLs from the url data downloaded by the web download tool. You can use regular expressions and BeautifulSoup to extract data. Regular Expressions use string-based fuzzy match, which has a good effect on the target data with distinctive features, but the versatility is not high. BeautifulSoup is a third-party module for structured resolution of url content. Parse the downloaded webpage content into a DOM tree, which is part of the output of a webpage in Baidu encyclopedia that is crawled by using BeautifulSoup.


For detailed use of BeautifulSoup, write it later. The following code uses python to capture other League of legends-related entries in the league of legends entries in Baidu encyclopedia and save these entries in the new excel file. Code:

From bs4 import BeautifulSoup import re import xlrd import xlwt from urllib. request import urlopen excelFile = xlwt. workbook () sheet = excelFile. add_sheet ('league of legend') # Baidu Encyclopedia: league of legends # html = urlopen ("http://baike.baidu.com/subview/3049782/11262116.htm") bsObj = BeautifulSoup (html. read (), "html. parser ") # print (bsObj. pretiterator () row = 0 for node in bsObj. find ("p", {"class": "main-content "}). findAll ("p", {"class": "para"}): links = node. findAll ("a", href = re. compile ("^ (/view/) [0-9] + \. htm $ ") for link in links: if 'href 'in link. attrs: print (link. attrs ['href '], link. get_text () sheet. write (row, 0, link. attrs ['href ']) sheet. write (row, 1, link. get_text () row = row + 1 excelFile. save ('e: \ Project \ Python \ lol.xls ')

The output part is as follows:



The excel section is as follows:

The above is all of the content in this article, hoping to help you learn Python web crawler.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.