4.1 urllib--open any resource via URL--2

Source: Internet
Author: User

At this point, we have successfully implemented a crawl of a Web page, how to save the obtained page in the form of a Web page to the local?

Ideas are as follows:
1 first crawls to a Web page and reads the crawled content to assign a variable
2 Write to open a local file, named *.html, such as Web page format
3 writes the value of the variable in 1 to the file.
4 Close the file

So we have just successfully obtained the content of the Baidu homepage and read the variable data assigned to it, and then you can pass
The code implementation saves the crawled page to local.

Fhandle=open ("e:/1.html", "W")
Fhandle.write (data)
Fhandle.close ()

After performing this operation, you can save the corresponding file in the root of the E drive, we first through the open () function
Opens the file, opens with a "w" binary write, assigns the handle to the variable fhandle when opened, and then
The corresponding data is written using the Write () method, followed by closing the file with the close () method.

You can find the 1.html file from the root directory.
And then using the browser to open this file, we found that entered the Baidu homepage page, but is missing the picture
, and I don't know why there's a lack of pictures,

In addition to this method, in Python, you can also use the Urlretrieve () function inside the Urllib.request
Write the corresponding information directly to the local file in the format: "Urllib.request.urlretrieve (url,filename= local file address)".
For example, we can directly use this method to write a Web page to a local file, enter:

filename = Urllib.request.urlretrieve ("http://edu.51cto.com", Filename= "d:/2.html")

After the execution, successfully saved "http://edu.51cto.com" to the local, when opened, if entered the first page crawl success.

#记住, is performed in Python3.



If you want to return information related to the current environment, we can use info () to return it, for example, to execute:
Print File.info ()

The result is:
Date:fri, Sep 08:35:19 GMT
content-type:text/html; Charset=utf-8
Transfer-encoding:chunked
Connection:close
Vary:accept-encoding
set-cookie:baiduid=24942564bed35ec70df96034ab71d9a3:fg=1; Expires=thu, 31-dec-37 23:55:55 GMT; max-age=2147483647; path=/; Domain=.baidu.com
SET-COOKIE:BIDUPSID=24942564BED35EC70DF96034AB71D9A3; Expires=thu, 31-dec-37 23:55:55 GMT; max-age=2147483647; path=/; Domain=.baidu.com
set-cookie:pstm=1504254919; Expires=thu, 31-dec-37 23:55:55 GMT; max-age=2147483647; path=/; Domain=.baidu.com
set-cookie:bdsvrtm=0; path=/
set-cookie:bd_home=0; path=/
set-cookie:h_ps_pssid=1439_21104_17001_20928; path=/; Domain=.baidu.com
p3p:cp= "OTI DSP COR IVA our IND COM"
Cache-control:private
cxy_all:baidu+71a6cd1e799e7f19eaadfd7f1425610a
Expires:fri, Sep 08:35:14 GMT
x-powered-by:hphp
server:bws/1.1
X-ua-compatible:ie=edge,chrome=1
Bdpagetype:1
bdqid:0xd8add75f0004af5a
bduserid:0
As you can see, the corresponding info is output, and the call format is: "Crawled Web page. info ()", the page we crawled to has been assigned to the variable file, so this is called by file.


If you want to get the status code of the current crawl Web page, we can use GetCode (), if return 200 is correct,
The other is incorrect, and the calling format is: "Crawled Web page. GetCode ()". In this example, we can perform:

Print File.getcode ()
200

As you can see, a status code of 200 is returned, indicating that the response is correct at this time.

If you want to get the URL address that is currently being crawled, we can use Geturl () to invoke the format: "Crawled Web page. Geturl ()", in this case,
Can be obtained by the following code:

Print File.geturl ()
Http://www.baidu.com

As you can see, the source Web page address of the crawl was output ' http://www.baidu.com ' at this time.

In general, only a subset of ASCII characters, such as numbers, letters, and some symbols, are only cloud-like in the URL standard, while others
Some characters, such as Chinese characters, are not in accordance with the URL standard. So if we use some other non-conformance in the URL
Standard characters are problematic, and URL encoding is required to resolve them. For example, enter the Chinese question in the URL or
":" or "&" and other characters that do not conform to the standard need to be encoded.

If we want to encode, we can use Urllib.quote (), for example, we want to the URL "http://www.sina.com.cn"
To encode, you can use the following code:

Print Urllib.quote ("http://www.sina.com.cn")

The results are as follows:
http%3a//www.sina.com.cn

Then the corresponding, there is encoding has decoding, can be implemented by Urllib.unquote (), just for our code URL
to decode

Print Urllib.unquote ("http%3a//www.sina.com.cn")
The results are as follows;
http://www.sina.com.cn

4.1 urllib--open any resource via URL--2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.