Introduction to Python web crawler instances

Last Update:2018-07-17 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes Python web crawler instances, crawler definitions, main frameworks, and other basic concepts in detail. For more information, see Python and web crawler.

1. crawler Definition

Crawler: a program that automatically captures Internet data.

2. Main crawler frameworks

Shows the main framework of the crawler. The crawler dispatcher obtains the URL link to be crawled through the URL manager. If the URL Manager contains the URL link to be crawled, the crawler scheduler calls the webpage download server to download the corresponding webpage, then calls the webpage parser to parse the webpage, and adds the new URL in the webpage to the URL manager to output valuable data.

3. crawler timeline

4. URL Manager

The URL Manager manages the URL set to be crawled and the URL set to be crawled to prevent repeated crawling and circular crawling. Shows the main functions of the URL Manager:

In terms of implementation, the URL manager mainly uses memory (set), and relational database (MySQL) in Python ). Small programs are generally implemented in memory. The set () Type Built in Python can automatically determine whether the elements are repeated. For larger programs, databases are generally used for implementation.

5. webpage download Tool

The webpage download tool in Python mainly uses the urllib library, which is a built-in python module. For the urllib2 library in version 2. x, it is integrated into urllib in python3.x and in its request and other sub-modules. The urlopen function in urllib is used to open a url and obtain url data. The parameters of the urlopen function can be url links or request objects. For simple web pages, it is sufficient to directly use url strings for parameters. However, for complex web pages, when you use the urlopen function for a webpage with anti-crawler mechanism, you need to add an http header. For webpages with logon mechanisms, you need to set cookies.

6. Web parser

The Web parser extracts valuable data and new URLs from the url data downloaded by the web download tool. You can use regular expressions and BeautifulSoup to extract data. Regular Expressions use string-based fuzzy match, which has a good effect on the target data with distinctive features, but the versatility is not high. BeautifulSoup is a third-party module for structured resolution of url content. Parse the downloaded webpage content into a DOM tree, which is part of the output of a webpage in Baidu encyclopedia that is crawled by using BeautifulSoup.

For detailed use of BeautifulSoup, write it later. The following code uses python to capture other League of legends-related entries in the league of legends entries in Baidu encyclopedia and save these entries in the new excel file. Code:

From bs4 import BeautifulSoup import re import xlrd import xlwt from urllib. request import urlopen excelFile = xlwt. workbook () sheet = excelFile. add_sheet ('league of legend') # Baidu Encyclopedia: league of legends # html = urlopen ("http://baike.baidu.com/subview/3049782/11262116.htm") bsObj = BeautifulSoup (html. read (), "html. parser ") # print (bsObj. pretiterator () row = 0 for node in bsObj. find ("p", {"class": "main-content "}). findAll ("p", {"class": "para"}): links = node. findAll ("a", href = re. compile ("^ (/view/) [0-9] + \. htm $ ") for link in links: if 'href 'in link. attrs: print (link. attrs ['href '], link. get_text () sheet. write (row, 0, link. attrs ['href ']) sheet. write (row, 1, link. get_text () row = row + 1 excelFile. save ('e: \ Project \ Python \ lol.xls ')

The output part is as follows:

The excel section is as follows:

The above is all of the content in this article, hoping to help you learn Python web crawler.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More