1. Chop a webpage down in minutes
How to grill Web pages? In fact , it is based on the URL to get its web page information, although we see in the browser is a beautiful picture, but in fact, is interpreted by the browser is presented, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is its clothes. So the most important part is in the HTML, let's write an example to pick a page down.
Python
1 2 3 4 |
#设置代理IP #代理IP可以上http://zhimaruanjian.com/Get Import Urllib2 Response = Urllib2.urlopen ("http://www.baidu.com") Print Response.read () |
Yes, you are right, the real program on two lines, save it to demo.py, enter the directory of the file, execute the following command to see the results of the operation, feel it.
Python
Look, this page's Source Code we have been stripped down, is not very sour?
2. How to analyze the Web page
So let's analyze the two lines of code, the first line
Python
1 |
Response = Urllib2.urlopen ("http://zhimaruanjian.com/") |
First we call the URLLIB2 library inside the Urlopen method, passed in a URL, this URL is Baidu home, protocol is the HTTP protocol, of course, you can also change HTTP Ftp,file,https and so on, just represent a Access control protocol,Urlopen generally accepts three parameters, its parameters are as follows:
Python
1 |
Urlopen (URL, data, timeout) |
The first parameter URL is the URL, the second parameter is the data to be transmitted when the URL is accessed, and the third timeout is the set time-out.
The 23rd parameter is non-transmitting,data default is null none,timeout default is Socket._global_default_timeout
first parameter URL is to be transmitted, in this example we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.
Python
The response object has a read method that returns the content of the Web page to which it was acquired.
What if I do not add read direct printing? The answers are as follows:
Python
1 |
<addinfourl at 139728495260376 whose FP = <socket._fileobject object at 0x7f1513fb3ad0>> |
directly print out the description of the object, so remember to add the Read method, otherwise it does not come out of the content can not blame me!
3. Construction Requset
in fact, the above urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this.
Python
1 2 3 4 5 |
Import Urllib2 Request = Urllib2. Request ("http://zhimaruanjian.com/") Response = Urllib2.urlopen (Request) Print Response.read () |
The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.
4.POST and get data transfer
The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.
the data user name and password are transmitted to a URL, and then you get the response after the server is processed, what should I do? Let me make it up to the little friends!
data transfer is divided into Post and get two ways, what's the difference between these two ways?
The most important difference is The Get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. Post does not display all the parameters on the URL, but it is not very convenient if you want to see what is being submitted directly, and you can choose as appropriate.
Post mode:
What do we mean by the data parameter? By the way, it's used here, and the data we're transmitting is this parameter, which shows the Post method.
Python
1 2 3 4 5 6 7 8 9 |
Import urllib Import urllib2 values = {"username": "[email protected]", " Password ":" XXXX "} data = Urllib.urlencode (values) url =" https://passport.csdn.net/ Account/login?from=http://my.csdn.net/my/mycsdn " request = urllib2. Request (Url,data) response = urllib2.urlopen (Request) Print response.read () |
we introduced the Urllib Library, and now we simulate the landing csdn, of course, the above code may not go in, because there are also some set header header work, or some parameters are not set all, has not been mentioned in this is not written up, here is just the principle of login. We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, URL and data, shipped Line program, you can achieve landing, the return is the landing page content rendered. Of course you can build a server to test it.
Note that there is another way to define the dictionary above, and the following notation is equivalent
Python
1 2 3 4 5 6 7 8 9 10 11 |
Import Urllib Import Urllib2 Values = {} values[' username ' = "[email protected]" values[' password ' = "XXXX" data = Urllib.urlencode (values) url = "Http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn" Request = Urllib2. Request (Url,data) Response = Urllib2.urlopen (Request) Print Response.read () |
the above method realizes Post-mode transfer
Get mode:
as Get mode we can directly write the parameters to the URL above, directly build a URL with parameters to come out.
Python
1 2 3 4 5 6 7 8 9 10 11 12 |
Import Urllib Import Urllib2 values={} values[' username ' = "[email protected]" values[' password ']= "XXXX" data = Urllib.urlencode (values) url = "Http://passport.csdn.net/account/login" Geturl = URL + "?" +data Request = Urllib2. Request (Geturl) Response = Urllib2.urlopen (Request) Print Response.read () |
you can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters
Python
1 |
Http://passport.csdn.net/account/login?username=1016903103%40qq.com&password=XXXX |
and our usual Get access is identical, so that the data can be sent in a way.
this section explanation Some basic use, you can crawl to some basic web information, small friends refueling!
Introduction to Python crawlers: basic use of the Urllib library