Describes the basic method of the Python web crawler function.

Last Update:2016-01-31 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Describes the basic method of the Python web crawler function.

Web CrawlerIs an image name. Comparing the Internet to a Spider, a Spider is a web crawler.

1. Web Crawler Definition

Web crawlers search for Web pages based on their link addresses. Starting from a website page (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next Webpage through these link addresses. This keeps repeating, until all the web pages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet. In this case, web crawler is a crawling program and a crawling program. The basic operation of Web Crawlers is to capture webpages.

2. Web page browsing Process

In fact, the process of capturing a webpage is the same as that of browsing a webpage through the IE browser. For example, enter www.baidu.com in the address bar of your browser.

The process of opening a webpage is actually that the browser acts as a browser "client" and sends a request to the server, "grabbing" the server file, and then explaining and presenting it.

HTML is a markup language that uses tags to tag content and parse and distinguish it. The function of the browser is to parse the obtained HTML code, and then convert the original code into a website page that we can directly see.

3. python-based Web Crawler

1). Obtain the html page using python

In fact, the most basic website capture can be done in two sentences:

import urllib2content = urllib2.urlopen('http://XXXX').read()

In this way, we can obtain the entire html document. The key issue is that we may need to obtain the useful information from this document, rather than the entire document. This requires parsing the html that is filled with various tags.

2) parse the html method after the python crawler crawls the page

Python crawler html Parser library SGMLParser

Python comes with HTMLParser, SGMLParser, and so on by default. The former is too difficult to use, so I wrote a sample program using SGMLParser:

import urllib2from sgmllib import SGMLParser class ListName(SGMLParser):def __init__(self):SGMLParser.__init__(self)self.is_h4 = ""self.name = []def start_h4(self, attrs):self.is_h4 = 1def end_h4(self):self.is_h4 = ""def handle_data(self, text):if self.is_h4 == 1:self.name.append(text) content = urllib2.urlopen('http://169it.com/xxx.htm').read()listname = ListName()listname.feed(content)for item in listname.name:print item.decode('gbk').encode('utf8')

A class called ListName is defined here to inherit the methods in SGMLParser. Use the is_h4 variable to mark and determine the h4 tag in the html file. If the h4 tag is encountered, add the content in the tag to the List variable name. The start_h4 () and end_h4 () functions are prototype:

start_tagname(self, attrs)end_tagname(self)

Tagname is the tag name. For example, if you encounter <pre>, start_pre is called. If you encounter </pre>, end_pre is called. Attrs is a tag parameter and is returned in the form of [(attribute, value), (attribute, value.

Python crawler html library pyQuery

PyQuery is the implementation of jQuery in python. It is very convenient to parse HTML documents using jQuery syntax. Install easy_install pyquery or

sudo apt-get install python-pyquery

Example:

from pyquery import PyQuery as pyqdoc=pyq(url=r'http://169it.com/xxx.html')cts=doc('.market-cat') for i in cts:print '====',pyq(i).find('h4').text() ,'===='for j in pyq(i).find('.sub'):print pyq(j).text() ,print '\n'

Python crawler html library BeautifulSoup

One of the headaches is that most web pages do not fully comply with the standards, and all sorts of inexplicable errors make it hard for the person who wants to find the webpage. To solve this problem, we can select the famous BeautifulSoup to parse html documents, which has good fault tolerance capabilities.

The above is all the content in this article. I have analyzed and introduced the Implementation of The Python web crawler function in detail and hope it will help you learn it.

Articles you may be interested in:

Using urllib2 to capture webpage content
Guide to Using urllib2 to write python crawlers without basic knowledge
Basic expression for writing python Crawlers
Full record of crawler writing for python Crawlers
No basic write python crawler: Use Scrapy framework to write Crawlers
Python implements simple crawler sharing for crawling links on pages
Using Python Pyspider as an example to analyze how to implement web crawler in search engines
Python3 simple Crawler
Write crawler programs in python
Simple python crawling

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More