#################################################
‘‘‘
Version: python2.7
Editor: Pycharm
Standard library: Urllib
Headers Page Header information:
Server:centos, Microsoft-iis
Content-type:text/html;charset=gbk
Last-modified: Update information
‘‘‘
Import Urllib
#查看urllib中所拥有的方法及变量
# print Dir (urllib)
# #利用help来查看urllib帮助文档
Print Help (Urllib)
Print Help (Urllib.urlopen)
url = "http://www.iplaypython.com/"
html = urllib.urlopen (URL)
Print Html.read ()
Print Html.info ()
‘‘‘
Date:fri, Sep 16:16:50 GMT #美国地区现在时间
Server:apache #linux系统下的apache服务器
Last-modified:mon, 02:05:22 GMT #最进更新时间
ETag: "52316-112e9-557c6ba29dc80" #etag是表示搜索引擎优化, the search engine is based on this tag with a tag above to determine whether your search engine is updated
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘
#获取网页的状态码
Print Html.getcode ()
#玩蛇王是utf-8 encoding, so no transcoding is necessary
#测试一个不是utf-8 website
URL1 = "http://www.163.com/"
HTML1 = Urllib.urlopen (URL1)
# print Html1.read (). Decode (' GBK '). Encode (' Utf-8 ')
#查看163. com header file information
# Print Html1.info ()
‘‘‘
The results are as follows:
Expires:fri, Sep 16:12:59 GMT
Date:fri, Sep 16:11:39 GMT #服务器信息所在地区的现在的时间
Server:nginx #服务器类型 (Linux or Windows, etc.)
content-type:text/html; CHARSET=GBK #网页编码
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq150:6 (CDN cache Server V2.0), 1.1 dunyidong72:1 (CDN cache server V2.0)
Connection:close
‘‘‘
#通过获取该网页的状态码来判断该网页是否可以访问, if the output is 200, it can be accessed, otherwise it will not be accessible.
Print Html1.getcode ()
#获取网址
Print Html1.geturl ()
‘‘‘
To popularize the content of the page's status code:
200: Indicates that it can be accessed normally
301: Redirect yourself to Baidu.
404: page does not exist # can create a Web site to judge
403: Access Forbidden: Permission issue/or no crawler crawl
500: Server Busy
Web development, this book is a must
‘‘‘
#网页抓取完成后, download Web page
Urllib.urlretrieve (URL, ' e:\\abc.txt ') #也可保存一个html文件.
Html.close ()
Html1.close ()
########################################################################################
The Ignore of #1 decode () method
# import Urllib
# URL = "http://www.163.com/"
# html = urllib.urlopen (URL)
# content = Html.read (). Decode ("GBK", ' ignore '). Encode ("Utf-8")
# Print Content
# 2 Conditional Judgment Statement, automated processing of crawled results
Import Urllib
url = "http://www.iplaypython.com/"
Html=urllib.urlopen (URL)
Print Html.read ()
#相当于: html = urllib.urlopen (URL). Read ()
#Pprint html
# Print Html.getcode ()
#相当于: html = urllib.urlopen (URL). GetCode ()
#print html
# Print Html.geturl ()
# Print Html.info ()
Code = Html.getcode ()
If code ==200:
Print "Web page OK"
Print Html.read ()
Print Html.info ()
Else
Print "Webpage has problems"
###############################################################
Import Urllib
url = "Http://www.iplaypython.com"
info = urllib.urlopen (URL). info ()
Print Info
‘‘‘
The result of the execution is:
Date:sat, Sep 05:08:27 GMT
Server:apache
Last-modified:mon, 02:05:22 GMT
ETag: "52316-112e9-557c6ba29dc80"
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘
Print Info.getparam ("CharSet")
#返回结果为: None It is known that some websites do not declare header information.
#我们修改一个网址来执行
url1= "Http://www.163.com"
Info1=urllib.urlopen (URL1). info ()
Print Info1
‘‘‘
Why do you want to enter a string parameter here?
Where the return value is obtained (obtained from the content type of the header file)
Expires:sat, Sep 05:09:47 GMT
Date:sat, Sep 05:08:27 GMT
Server:nginx
content-type:text/html; Charset=gbk
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq153:8 (CDN cache Server V2.0), 1.1 dunyidong75:4 (CDN cache server V2.0)
Connection:close
We know from the beginning of the file: There are CHARSET=GBK in its content type,
‘‘‘
Print Info1.getparam ("CharSet")
#返回结果为: GBK
Operation and use of Urllib--1