Python web crawler implementation code

Source: Internet
Author: User
Tags python web crawler

Python web crawler implementation code

First, let's look at a Python library for capturing web pages: urllib or urllib2.

What is the difference between urllib and urllib2?
You can use urllib2 as the extension of urllib. The obvious advantage is that urllib2.urlopen () can accept the Request object as a parameter, thus controlling the header of the HTTP Request.
Urllib2 library should be used whenever possible during HTTP Request, but urllib. urlretrieve () function and urllib. quote and other quote and unquote functions are not added to urllib2. Therefore, urllib is also required.

Urllib. open () the input parameters must follow some protocols, such as http, ftp, and file. For example:

Urllib. open ('HTTP: // www.baidu.com ')
Urllib. open ('file: D \ Python \ Hello. py ')

Here is an example of downloading all gif images on a website. The Python code is as follows:

import reimport urllibdef getHtml(url): page = urllib.urlopen(url) html = page.read() return htmldef getImg(html): reg = r'src="(.*?\.gif)"' imgre = re.compile(reg) imgList = re.findall(imgre,html) print imgList cnt = 1 for imgurl in imgList:  urllib.urlretrieve(imgurl,'%s.jpg' %cnt)  cnt += 1if __name__ == '__main__': html = getHtml('http://www.baidu.com') getImg(html)

Based on the above method, we can capture certain webpages and then extract the data we need.

In fact, the efficiency of using the urllib module for Web crawler is extremely low. Next we will introduce Tornado Web Server.
Tornado web server is a lightweight, highly scalable, and non-blocking I/O Web server software written in Python. The famous Friendfeed website is built with it. Unlike other mainstream Web Server frameworks (mainly Python frameworks), Tornado uses epoll for non-blocking I/O and responds quickly. It can process thousands of concurrent connections and is especially suitable for real-time Web services.

Using Tornado Web Server to capture webpages is more efficient.
From Tornado's official website, we also need to install backports. ssl_match_hostname. The official website is as follows:

Http://www.tornadoweb.org/en/stable/

import tornado.httpclientdef Fetch(url): http_header = {'User-Agent' : 'Chrome'} http_request = tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=200,request_timeout=600) print 'Hello' http_client = tornado.httpclient.HTTPClient() print 'Hello World' print 'Start downloading data...' http_response = http_client.fetch(http_request) print 'Finish downloading data...' print http_response.code all_fields = http_response.headers.get_all() for field in all_fields:  print field print http_response.bodyif __name__ == '__main__': Fetch('http://www.baidu.com')

Common urllib2 methods:

(1) info () obtains the Header information of the webpage.

(2) getcode () obtains the status code of the webpage.

(3) geturl () obtains the input URL

(4) read () read the file content

Articles you may be interested in:
  • Example of using a python web crawler to collect Lenovo words
  • Python uses rabbitmq to implement web crawler examples
  • Use Python to write simple web crawlers to capture video download resources
  • Using Python Pyspider as an example to analyze how to implement web crawler in search engines
  • Web Crawler instance of Baidu post bar based on Python
  • Detailed explanation of the basic writing of the Python web crawler Function

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.