The basic description of Python web crawler function

Last Update:2017-03-13 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly describes the Python web crawler function of the basic wording, web crawler, the Web spider, is a very image of the name. The internet analogy to a spider web, then spider is crawling on the Internet spiders, the network crawler interested friends can refer to this article

The web crawler, the spider, is a very vivid name. The internet is likened to a spider's web, so spiders are crawling around the web.

1. The definition of web crawler

Web spiders are looking for Web pages through the URL of a Web page. From one page of the site (usually the homepage), read the contents of the Web page, find the other links in the Web page, and then find the next page through these links, so that the cycle continues until all the pages of this site have been crawled until the end. If the entire Internet as a Web site, then the network spider can use this principle to the Internet all the pages are crawled down. In this way, the web crawler is a crawling program, a crawl Web page program. The basic operation of web crawler is to crawl Web pages.

2. The process of browsing the web

The process of crawling a Web page is the same as when the reader usually uses Internet Explorer to browse the Web. For example, you enter www.baidu.com this address in the address bar of your browser.

The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show.

HTML is a markup language that tags content and parses and differentiates it. The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.

3. Python-based web crawler capabilities

1). Python Get HTML page

In fact, the most basic grasp of the station, two words can be:

Import urllib2content = Urllib2.urlopen (' Http://XXXX '). Read ()

This will get the entire HTML document, and the key question is that we may need to get the useful information we need from this document, not the entire document. This requires parsing HTML that is filled with various tags.

2). Python crawler crawl page after parsing HTML method

Python crawler HTML parsing library Sgmlparser

Python default comes with Htmlparser and sgmlparser and so on parser, the former is really too difficult to use, I wrote with Sgmlparser a sample program:

Import urllib2from sgmllib Import Sgmlparser class ListName (sgmlparser):d EF init (self): Sgmlparser.init (self) self.is_ h4 = "" Self.name = []def start_h4 (Self, attrs): Self.is_h4 = 1def end_h4 (self): Self.is_h4 = "" Def handle_data (self, text): I F Self.is_h4 = = 1:self.name.append (text) content = Urllib2.urlopen (' http://169it.com/xxx.htm '). Read () ListName = ListName () listname.feed (content) for item in Listname.name:print item.decode (' GBK '). Encode (' UTF8 ')

Quite simply, this defines a class called ListName that inherits the methods inside the Sgmlparser. Use a variable is_h4 tag to determine the H4 tag in the HTML file, and if you encounter a H4 tag, add the contents of the tag to the list variable name. Explain the Start_h4 () and END_H4 () functions, their prototypes are in Sgmlparser.

Start_tagname (self, attrs) end_tagname (self)

TagName is the name of the tag, for example, when encountering <pre>, it calls Start_pre, encounters </pre>, and calls End_pre. Attrs is the parameter of the label, which is passed back in the form [(attribute, value), (attribute, value), ...].

Python crawler HTML parsing library Pyquery

Pyquery is the implementation of jquery in Python, and it is very convenient to manipulate the parsing of HTML documents in the syntax of jquery. Need to install before use, Easy_install pyquery can, or Ubuntu under

sudo apt-get install Python-pyquery

The following example:

From pyquery import pyquery as Pyqdoc=pyq (url=r ' http://169it.com/xxx.html ') cts=doc ('. Market-cat ') for i in cts:print ' = = = = ', Pyq (i). Find (' H4 '). Text (), ' = = = ' for j in Pyq (i). Find ('. Sub '):p rint Pyq (j). Text (), print ' \ n '

Python crawler HTML parsing library BeautifulSoup

The problem with a headache is that most of the pages are not written exactly as they are, and all sorts of inexplicable mistakes make it difficult to find the person who wrote the page to beat up. To solve this problem, we can choose the famous BeautifulSoup to parse the HTML document, it has the good fault-tolerant ability.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More