Operation and use of Urllib--1

Last Update:2017-09-02 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

#################################################
‘‘‘
Version: python2.7
Editor: Pycharm
Standard library: Urllib
Headers Page Header information:
Server:centos, Microsoft-iis
Content-type:text/html;charset=gbk
Last-modified: Update information

‘‘‘
Import Urllib
#查看urllib中所拥有的方法及变量
# print Dir (urllib)
# #利用help来查看urllib帮助文档
Print Help (Urllib)
Print Help (Urllib.urlopen)
url = "http://www.iplaypython.com/"
html = urllib.urlopen (URL)
Print Html.read ()
Print Html.info ()

‘‘‘
Date:fri, Sep 16:16:50 GMT #美国地区现在时间
Server:apache #linux系统下的apache服务器
Last-modified:mon, 02:05:22 GMT #最进更新时间
ETag: "52316-112e9-557c6ba29dc80" #etag是表示搜索引擎优化, the search engine is based on this tag with a tag above to determine whether your search engine is updated
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘

#获取网页的状态码
Print Html.getcode ()
#玩蛇王是utf-8 encoding, so no transcoding is necessary
#测试一个不是utf-8 website
URL1 = "http://www.163.com/"
HTML1 = Urllib.urlopen (URL1)
# print Html1.read (). Decode (' GBK '). Encode (' Utf-8 ')
#查看163. com header file information
# Print Html1.info ()

‘‘‘
The results are as follows:
Expires:fri, Sep 16:12:59 GMT
Date:fri, Sep 16:11:39 GMT #服务器信息所在地区的现在的时间
Server:nginx #服务器类型 (Linux or Windows, etc.)
content-type:text/html; CHARSET=GBK #网页编码
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq150:6 (CDN cache Server V2.0), 1.1 dunyidong72:1 (CDN cache server V2.0)
Connection:close
‘‘‘

#通过获取该网页的状态码来判断该网页是否可以访问, if the output is 200, it can be accessed, otherwise it will not be accessible.
Print Html1.getcode ()
#获取网址
Print Html1.geturl ()

‘‘‘
To popularize the content of the page's status code:
200: Indicates that it can be accessed normally
301: Redirect yourself to Baidu.
404: page does not exist # can create a Web site to judge
403: Access Forbidden: Permission issue/or no crawler crawl
500: Server Busy
Web development, this book is a must
‘‘‘

#网页抓取完成后, download Web page
Urllib.urlretrieve (URL, ' e:\\abc.txt ') #也可保存一个html文件.
Html.close ()
Html1.close ()

########################################################################################
The Ignore of #1 decode () method
# import Urllib
# URL = "http://www.163.com/"
# html = urllib.urlopen (URL)
# content = Html.read (). Decode ("GBK", ' ignore '). Encode ("Utf-8")
# Print Content

# 2 Conditional Judgment Statement, automated processing of crawled results
Import Urllib
url = "http://www.iplaypython.com/"

Html=urllib.urlopen (URL)
Print Html.read ()
#相当于: html = urllib.urlopen (URL). Read ()
#Pprint html

# Print Html.getcode ()
#相当于: html = urllib.urlopen (URL). GetCode ()
#print html

# Print Html.geturl ()
# Print Html.info ()

Code = Html.getcode ()
If code ==200:
Print "Web page OK"
Print Html.read ()
Print Html.info ()
Else
Print "Webpage has problems"

###############################################################
Import Urllib
url = "Http://www.iplaypython.com"
info = urllib.urlopen (URL). info ()
Print Info
‘‘‘
The result of the execution is:
Date:sat, Sep 05:08:27 GMT
Server:apache
Last-modified:mon, 02:05:22 GMT
ETag: "52316-112e9-557c6ba29dc80"
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘

Print Info.getparam ("CharSet")
#返回结果为: None It is known that some websites do not declare header information.

#我们修改一个网址来执行
url1= "Http://www.163.com"
Info1=urllib.urlopen (URL1). info ()
Print Info1
‘‘‘
Why do you want to enter a string parameter here?
Where the return value is obtained (obtained from the content type of the header file)
Expires:sat, Sep 05:09:47 GMT
Date:sat, Sep 05:08:27 GMT
Server:nginx
content-type:text/html; Charset=gbk
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq153:8 (CDN cache Server V2.0), 1.1 dunyidong75:4 (CDN cache server V2.0)
Connection:close
We know from the beginning of the file: There are CHARSET=GBK in its content type,
‘‘‘
Print Info1.getparam ("CharSet")
#返回结果为: GBK

Operation and use of Urllib--1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More