Operation and use of Urllib--1

Source: Internet
Author: User
Tags ranges

#################################################
‘‘‘
Version: python2.7
Editor: Pycharm
Standard library: Urllib
Headers Page Header information:
Server:centos, Microsoft-iis
Content-type:text/html;charset=gbk
Last-modified: Update information


‘‘‘
Import Urllib
#查看urllib中所拥有的方法及变量
# print Dir (urllib)
# #利用help来查看urllib帮助文档
Print Help (Urllib)
Print Help (Urllib.urlopen)
url = "http://www.iplaypython.com/"
html = urllib.urlopen (URL)
Print Html.read ()
Print Html.info ()


‘‘‘
Date:fri, Sep 16:16:50 GMT #美国地区现在时间
Server:apache #linux系统下的apache服务器
Last-modified:mon, 02:05:22 GMT #最进更新时间
ETag: "52316-112e9-557c6ba29dc80" #etag是表示搜索引擎优化, the search engine is based on this tag with a tag above to determine whether your search engine is updated
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘


#获取网页的状态码
Print Html.getcode ()
#玩蛇王是utf-8 encoding, so no transcoding is necessary
#测试一个不是utf-8 website
URL1 = "http://www.163.com/"
HTML1 = Urllib.urlopen (URL1)
# print Html1.read (). Decode (' GBK '). Encode (' Utf-8 ')
#查看163. com header file information
# Print Html1.info ()


‘‘‘
The results are as follows:
Expires:fri, Sep 16:12:59 GMT
Date:fri, Sep 16:11:39 GMT #服务器信息所在地区的现在的时间
Server:nginx #服务器类型 (Linux or Windows, etc.)
content-type:text/html; CHARSET=GBK #网页编码
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq150:6 (CDN cache Server V2.0), 1.1 dunyidong72:1 (CDN cache server V2.0)
Connection:close
‘‘‘


#通过获取该网页的状态码来判断该网页是否可以访问, if the output is 200, it can be accessed, otherwise it will not be accessible.
Print Html1.getcode ()
#获取网址
Print Html1.geturl ()


‘‘‘
To popularize the content of the page's status code:
200: Indicates that it can be accessed normally
301: Redirect yourself to Baidu.
404: page does not exist # can create a Web site to judge
403: Access Forbidden: Permission issue/or no crawler crawl
500: Server Busy
Web development, this book is a must
‘‘‘


#网页抓取完成后, download Web page
Urllib.urlretrieve (URL, ' e:\\abc.txt ') #也可保存一个html文件.
Html.close ()
Html1.close ()


########################################################################################
The Ignore of #1 decode () method
# import Urllib
# URL = "http://www.163.com/"
# html = urllib.urlopen (URL)
# content = Html.read (). Decode ("GBK", ' ignore '). Encode ("Utf-8")
# Print Content



# 2 Conditional Judgment Statement, automated processing of crawled results
Import Urllib
url = "http://www.iplaypython.com/"

Html=urllib.urlopen (URL)
Print Html.read ()
#相当于: html = urllib.urlopen (URL). Read ()
#Pprint html

# Print Html.getcode ()
#相当于: html = urllib.urlopen (URL). GetCode ()
#print html

# Print Html.geturl ()
# Print Html.info ()

Code = Html.getcode ()
If code ==200:
Print "Web page OK"
Print Html.read ()
Print Html.info ()
Else
Print "Webpage has problems"


###############################################################
Import Urllib
url = "Http://www.iplaypython.com"
info = urllib.urlopen (URL). info ()
Print Info
‘‘‘
The result of the execution is:
Date:sat, Sep 05:08:27 GMT
Server:apache
Last-modified:mon, 02:05:22 GMT
ETag: "52316-112e9-557c6ba29dc80"
Accept-ranges:bytes
content-length:70377
Vary:accept-encoding
Connection:close
Content-type:text/html
‘‘‘

Print Info.getparam ("CharSet")
#返回结果为: None It is known that some websites do not declare header information.



#我们修改一个网址来执行
url1= "Http://www.163.com"
Info1=urllib.urlopen (URL1). info ()
Print Info1
‘‘‘
Why do you want to enter a string parameter here?
Where the return value is obtained (obtained from the content type of the header file)
Expires:sat, Sep 05:09:47 GMT
Date:sat, Sep 05:08:27 GMT
Server:nginx
content-type:text/html; Charset=gbk
Vary:accept-encoding,user-agent,accept
Cache-control:max-age=80
x-via:1.1 Shq153:8 (CDN cache Server V2.0), 1.1 dunyidong75:4 (CDN cache server V2.0)
Connection:close
We know from the beginning of the file: There are CHARSET=GBK in its content type,
‘‘‘
Print Info1.getparam ("CharSet")
#返回结果为: GBK

Operation and use of Urllib--1

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.