First step: Get
#-*-coding:utf-8-*-#Date: 2018/5/15 19:39#Author: Small mouse fromUrllibImportRequesturl='http://news.sina.com.cn/guide/'Response= Request.urlopen (URL)#returns an HTTP objectWeb_data = Response.read (). Decode ('Utf-8')#Response ContentWeb_status = Response.Status#Response Status CodePrint(Web_status,web_data)
Post
#-*-coding:utf-8-*-#Date: 2018/5/15 19:39#Author: Small mouse fromUrllibImportRequest,parseurl='http://news.sina.com.cn/guide/'#What the Post form submitsdata = [ ('name','Xiaoshubiao'), ('pwd','Xiaoshubiao')]login_data= Parse.urlencode (data). Encode ('Utf-8') Response= Request.urlopen (Url,data = login_data)#returns an HTTP objectWeb_data = Response.read (). Decode ('Utf-8')#Response ContentWeb_status = Response.Status#Response Status CodePrint(Web_status,web_data)
Step two: Disguise the browser
#-*-coding:utf-8-*-#Date: 2018/5/15 19:39#Author: Small mouse fromUrllibImportRequest,parseurl='http://news.sina.com.cn/guide/'req=request. Request (URL) req.add_header ('user-agent','mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 ubrowser/6.2.3964.2 safari/537.36') Req.add_header ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8') Response=Request.urlopen (req) Web_data= Response.read (). Decode ('Utf-8')#Response ContentWeb_status = Response.Status#Response Status CodePrint(Web_status,web_data)
Step three: Using proxy IP
#-*-coding:utf-8-*-#Date: 2018/5/15 19:39#Author: Small mouse fromUrllibImportRequest,parseurl=' http://news.sina.com.cn/guide/'req=request. Request (URL)#using proxy IPProxy = Request. Proxyhandler ({'http':'221.207.29.185:80'}) Opener=Request.build_opener (proxy, request.) HttpHandler) Request.install_opener (opener) Req.add_header ('user-agent','mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 ubrowser/6.2.3964.2 safari/537.36') Req.add_header ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8') Response=Request.urlopen (req) Web_data= Response.read (). Decode ('Utf-8')#Response ContentWeb_status = Response.Status#Response Status CodePrint(Web_status,web_data)
Fourth Step: Content parsing
You can use the encapsulated BeautifulSoup, or you can use the re-regular to match, the principle is similar.
Study Notes Urllib