Python3 crawler--urlllib Use

Last Update:2018-01-09 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

content reference to: 1.https://www.cnblogs.com/lands-ljk/p/5447127.html 2.https://cuiqingcai.com/947.html1. Chop a webpage down in minutes

How to grill Web pages? In fact, it is based on the URL to get its web page information, although we see in the browser is a beautiful picture, but in fact, is interpreted by the browser is presented, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is its clothes. So the most important part is in the HTML, let's write an example to pick a page down.

From urllib.request import urlopenresponse=urlopen ("http://www.baidu.com") Html=response.read (). Decode (' Utf-8 ') Print (HTML)

Yes, you are right, the real program on two lines, save it to demo.py, enter the directory of the file, execute the following command to see the results of the operation, feel it.

2. How to analyze the Web page

So let's analyze the two lines of code, the first line

Response=urlopen ("http://www.baidu.com")

First we call the Urllib library inside the request under the Urlopen method, pass in a URL, this URL is Baidu home, protocol is the HTTP protocol, of course, you can also change HTTP Ftp,file,https and so on, just represent a kind of access control protocol, Urlopen generally accepts three parameters. Here's how to use it:

urllib.request. urlopen( url, data=none, [ timeout, ] *, cafile=none, capath= None, cadefault=false, context=none)

-url: The URL that needs to be opened

-Data submitted by Data:post

-Timeout: Set the access time-out for a website

The first parameter URL is to be transmitted, in this example we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.

If you do not add read: The description of the object is printed directly. So remember to add the Read method, otherwise it does not come out of the content can not blame me!

Get the page directly with the Urlopen () of the Urllib.request module, the data format of page is bytes type, need decode () decode, convert to STR type.

1 from urllib import request2 response = Request.urlopen (R ' http://python.org/') # 
Urlopen provides methods for returning objects:
-Read (), ReadLine (), ReadLines (), Fileno (), close (): operation on HttpResponse type data
-INFO (): Returns the Httpmessage object that represents the header information returned by the remote server
-GetCode (): Returns the HTTP status code. If it is an HTTP request, 200 request completed successfully; 404 URL not Found
-Geturl (): Returns the requested URL
3. Structuring the request
In fact, the above Urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this.
From urllib import Request req=request. Request ("http://www.baidu.com") print (req) response=request.urlopen (req) html=response.read () HTML = Html.decode (' Utf-8 ') print (HTML)
The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.
urllib.request.Request(URL, Data=none, headers={}, Method=none) 
Use request () to wrap the requests, and then get the page through Urlopen ().
Another example:
url = R ' http://www.lagou.com/zhaopin/Python/?labelWords=label ' headers = {    ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '                  R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',    ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ',    ' Connection ': ' keep-alive '}req = Request. Request (URL, headers=headers) page = Request.urlopen (req). Read () page = Page.decode (' Utf-8 ')
Data to wrap the head:
-User-agent: This head can carry the following information: Browser name and version number, operating system name and version number, default language
-Referer: Can be used to prevent hotlinking, there are some Web site image display source http://***.com, is to check Referer to identify
-Connection: Indicates the status of the connection and logs the session status.
4.POST and get data transfer
The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.
Send the data user name and password to a URL, and then you get the response after the server processing, what should I do? Let me make it up to the little friends!
Data transmission is divided into post and get two kinds of ways, what is the difference between the two ways?
The most important difference is that the Get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. Post does not display all the parameters on the URL, but it is not very convenient if you want to see what is being submitted directly, and you can choose as appropriate.
(1) POST
urllib.request.urlopen(url, data=none, [timeout, ]*, cafile=none, capath= None, cadefault=false, context=none)   
The data parameter of Urlopen () defaults to none, and when the data parameter is not empty, Urlopen () is submitted as post.
An example:
From urllib import request, parseURL = R ' Http://www.lagou.com/jobs/positionAjax.json? ' headers = {    ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) '                  R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ',    ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ',    ' Connection ': ' keep-alive '}data = {    ' First ': ' True ',    ' pn ': 1,    ' kd ': ' Python '}data = Parse.urlencode (data). Encode (' utf-8 ') req = Request. Request (Url,data=data, headers=headers,) page = Request.urlopen (req). Read () page = Page.decode (' utf-8 ') print (page)
urllib.parse.urlencode(query, Doseq=false, safe= ", Encoding=none, Errors=none)
The main function of UrlEncode () is to enclose the URL with the data to be submitted.
data = {    ' first ': ' True ',    ' pn ': 1,    ' kd ': ' Python '}data = Parse.urlencode (data). Encode (' Utf-8 ')
After the UrlEncode () converted data is First=true?pn=1?kd=python, the last URL submitted is
Http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python
The data for the post must be bytes or iterable of bytes, not str, so encode () encoding is required
Of course, data can also be encapsulated in the Urlopen () parameter.
Another example:
Now we want to simulate the landing csdn: Of course, the following code may not go in, because CSDN also has a serial number of the field, not set the whole, more complex in here do not write up, here is just a description of the principle of login. The general login site is usually this kind of notation.
values={' username ': "[email protected]", "Password": "xxxx"}data=parse.urlencode (values). Encode (' Utf-8 ')
We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, url and data, Run the program and return the content of the page rendered after post.
(2) GET
As for Get mode we can directly write the parameters to the URL, directly build a URL with parameters to come out.
values={' username ': "[email protected]", "Password": "xxxx"}data=parse.urlencode (values). Encode (' Utf-8 ') url= "http ://passport.csdn.net/account/login "geturl=url+"? " +datareq = Request. Request (geturl) page = Request.urlopen (req). Read () page = Page.decode (' utf-8 ') print (page)
You can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters
And we usually get access to the same way, so that the data get the way to transfer.
　
Python3 crawler--urlllib Use

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More