[Python] web crawler (ii): Use URLLIB2 to crawl Web content via a specified URL

Source: Internet
Author: User
Tags html form

Version number: Python2.7.5,python3 the change is large.

The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local.
Similar to the use of the program to simulate the function of IE browser, the URL as the content of the HTTP request to the server side, and then read the server-side response resources.


In Python, we use the URLLIB2 component to crawl Web pages.
URLLIB2 is a component of Python's acquisition of URLs (Uniform Resource Locators).

It provides a very simple interface in the form of a urlopen function.

The simplest URLLIB2 application code requires only four rows.

Let's create a new file urllib2_test01.py to feel the Urllib2 effect:

Import urllib2   = Urllib2.urlopen ('http://www.sina.com/')  = response.read ()   Print html  

Press F5 to see the results of the run:

We can open the Sina homepage, right click, select View Source code (Firefox or Google browser can), will find the exact same content.

In other words, the above four lines of code will print out all the code that we received when we visited Sina.

This is one of the simplest examples of urllib2.

In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.

HTTP is based on the request and response mechanism:

The client presents a request and the server provides a response.

URLLIB2 uses a Request object to map the HTTP request you made.

In its simplest form of use, you will create a request object with the address you want,

By calling Urlopen and passing in the request object, a related request response object is returned.

This response object is like a file object, so you can call. Read () in response.

Let's create a new file urllib2_test02.py to feel:

Import urllib2     = Urllib2. Request ('http://www.sina.com')    = urllib2.urlopen (req)    = response.read ()     Print

You can see that the output content is the same as the test01.

URLLIB2 uses the same interface to handle all URL headers. For example, you can create an FTP request as follows.

req = Urllib2. Request ('ftp://example.com/')  

There are two additional things that you are allowed to do in an HTTP request.

1. Sending data forms

This content is believed to have done the web side is not unfamiliar,

Sometimes you want to send some data to the URL (usually URL with the cgi[Universal Gateway Interface] script, or another Web application to hook up).

In HTTP, this is often sent using a well-known post request.

This is usually done by your browser when you submit an HTML form.

Not all posts are sourced from the form, and you can use post to submit arbitrary data to your own program.

In general HTML forms, data needs to be encoded in standard form. The data parameter is then passed to the request object.

Coding works using Urllib functions rather than urllib2.

Let's create a new file urllib2_test03.py to feel:

[Python] web crawler (ii): Use URLLIB2 to crawl Web content via a specified URL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.