Describes the basic method of the Python web crawler function.

Source: Internet
Author: User
Tags python web crawler

Describes the basic method of the Python web crawler function.

Web CrawlerIs an image name. Comparing the Internet to a Spider, a Spider is a web crawler.

1. Web Crawler Definition

Web crawlers search for Web pages based on their link addresses. Starting from a website page (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next Webpage through these link addresses. This keeps repeating, until all the web pages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet. In this case, web crawler is a crawling program and a crawling program. The basic operation of Web Crawlers is to capture webpages.

2. Web page browsing Process

In fact, the process of capturing a webpage is the same as that of browsing a webpage through the IE browser. For example, enter www.baidu.com in the address bar of your browser.

The process of opening a webpage is actually that the browser acts as a browser "client" and sends a request to the server, "grabbing" the server file, and then explaining and presenting it.

HTML is a markup language that uses tags to tag content and parse and distinguish it. The function of the browser is to parse the obtained HTML code, and then convert the original code into a website page that we can directly see.

3. python-based Web Crawler

1). Obtain the html page using python

In fact, the most basic website capture can be done in two sentences:

import urllib2content = urllib2.urlopen('http://XXXX').read()

In this way, we can obtain the entire html document. The key issue is that we may need to obtain the useful information from this document, rather than the entire document. This requires parsing the html that is filled with various tags.

2) parse the html method after the python crawler crawls the page

Python crawler html Parser library SGMLParser

Python comes with HTMLParser, SGMLParser, and so on by default. The former is too difficult to use, so I wrote a sample program using SGMLParser:

import urllib2from sgmllib import SGMLParser class ListName(SGMLParser):def __init__(self):SGMLParser.__init__(self)self.is_h4 = ""self.name = []def start_h4(self, attrs):self.is_h4 = 1def end_h4(self):self.is_h4 = ""def handle_data(self, text):if self.is_h4 == 1:self.name.append(text) content = urllib2.urlopen('http://169it.com/xxx.htm').read()listname = ListName()listname.feed(content)for item in listname.name:print item.decode('gbk').encode('utf8')

A class called ListName is defined here to inherit the methods in SGMLParser. Use the is_h4 variable to mark and determine the h4 tag in the html file. If the h4 tag is encountered, add the content in the tag to the List variable name. The start_h4 () and end_h4 () functions are prototype:

start_tagname(self, attrs)end_tagname(self)

Tagname is the tag name. For example, if you encounter <pre>, start_pre is called. If you encounter </pre>, end_pre is called. Attrs is a tag parameter and is returned in the form of [(attribute, value), (attribute, value.

Python crawler html library pyQuery

PyQuery is the implementation of jQuery in python. It is very convenient to parse HTML documents using jQuery syntax. Install easy_install pyquery or

sudo apt-get install python-pyquery

Example:

from pyquery import PyQuery as pyqdoc=pyq(url=r'http://169it.com/xxx.html')cts=doc('.market-cat') for i in cts:print '====',pyq(i).find('h4').text() ,'===='for j in pyq(i).find('.sub'):print pyq(j).text() ,print '\n'

Python crawler html library BeautifulSoup

One of the headaches is that most web pages do not fully comply with the standards, and all sorts of inexplicable errors make it hard for the person who wants to find the webpage. To solve this problem, we can select the famous BeautifulSoup to parse html documents, which has good fault tolerance capabilities.

The above is all the content in this article. I have analyzed and introduced the Implementation of The Python web crawler function in detail and hope it will help you learn it.

Articles you may be interested in:
  • Using urllib2 to capture webpage content
  • Guide to Using urllib2 to write python crawlers without basic knowledge
  • Basic expression for writing python Crawlers
  • Full record of crawler writing for python Crawlers
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Python implements simple crawler sharing for crawling links on pages
  • Using Python Pyspider as an example to analyze how to implement web crawler in search engines
  • Python3 simple Crawler
  • Write crawler programs in python
  • Simple python crawling

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.