Python crawls Web pages using URLLIB2

Source: Internet
Author: User

1, use the Python library urllib2, use the Urlopen and the request method.

2, Method Urlopen prototype

urllib2. Urlopen (url[, data][, timeout])

url represents the destination Web page address, which can be a string, It can also be requested object request

data represents the parameters submitted to the target server by post

timeout indicates a time-out setting of

example:


[python] view plaincopy
  1. >>> IMPORT URLLIB2  
  2. >>> socket = urllib2.urlopen ( "http://www.baidu.com")   
  3. >>> content = socket.read ()   
  4. >>> socket.close ()   

 

In this way, the contents of the Web page crawl down, but some sites prohibit crawlers, if the direct request will be the following error:

Urllib2. Httperror:http Error 403:forbidden

The workaround is to use the request method to add header information to the browser, masquerading as the access behavior of the viewer:

3. method Request prototype

urllib2. Request (url[, data][, headers][, origin_req_host][, unverifiable])

which

The URL represents the destination Web page address, which can be a string or request object

Data represents the parameters that the Post method submits to the target server

Headers represents a user ID, is a dictionary type of data, some do not allow the script to crawl, so the need for user agents, such as the agent of the Firefox browser is similar: mozilla/5.0 (X11; U Linux i686) The standard UA format for gecko/20071127 firefox/2.0.0.11 browser is: Browser identification (operating system identity; encryption level identification; browser language) The rendering engine identifies version information, Headers default is python-urllib/2.6

Origin_req_host represents the host domain name or IP address of the requester

See an example:

[Python]View Plaincopy
    1. >>> headers = {' user-agent ':' mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 '}
    2. >>> req = urllib2. Request (url="Http://blog.csdn.net/deqingguo", Headers=headers)
    3. >>> socket = Urllib2.urlopen (req)
    4. >>> content = Socket.read ()
    5. >>> Socket.close ()

2:

Import urllib2 as Ulurl = ' http://www.dd.com/products?selected.classification=primary+antibodies& Selected.researchareas=metabolism--types+of+disease--cancer ' headers = {' user-agent ': ' mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11 '}  req = ul. Request (url,headers=headers)  f = ul.urlopen (req) content = F.read ();p rint f.getcode ();

[Python]View Plaincopy
      1. <pre></pre>
      2. <p></p>
      3. <pre></pre>

Python crawls Web pages using URLLIB2

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.