6.python3 Crawler's Urllib Library

Last Update:2018-05-03 Source: Internet

Author: User

Tags file url urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

= Urllib.request.urlopen ("http://www.baidu.com"= response.read ()# Print the string, remember to add the decode (' utf-8 ') method, will not appear \ n \print (Html.decode (' Utf-8 '))

Request

In our first example, the parameter of Urlopen () is a URL address;

However, if you need to perform more complex operations, such as adding HTTP headers, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the Request instance.

Import Urllib.requesturl="http://www.itcast.cn"#User-Agent is the first step in anti-reptile struggle list_headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36"The }# URL, together with the headers, constructs the requests request, which is accompanied by a IE9.0Browser's user-agentrequest= Urllib.request.Request (url, headers =list_headers) # Send this request to the server response=Urllib.request.urlopen (Request) HTML=Response.read () print (Html.decode ('Utf-8'))

The browser is the world's accepted identity of the Internet, if we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests.

#添加一个特定的headerimport Urllib.requesturl="http://www.itcast.cn"#IE9.0The user-Agentheader= {"user-agent":"mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0;"}request= Urllib.request.Request (url, headers =header) #也可以通过调用Request. Add_header () Add/modifies a specific headerrequest.add_header ("Connection","keep-alive"# can also be viewed by calling Request.get_header () to view header information # Request.get_header (Header_name="Connection") Response=Urllib.request.urlopen (Request) #查看响应状态码200print (Response.code) HTML=Response.read () print (Html.decode ('Utf-8'))


Randomly Add/Modify User-agent

#随机添加/Modify user-agentimport urllib.requestimport Randomurl="http://www.itcast.cn"ua_list= [    "mozilla/5.0 (Windows NT 6.1;) Apple ....",    "mozilla/5.0 (X11; CrOS i686 2268.111.0) ...",    "mozilla/5.0 (Macintosh; U PPC Mac OS X ....",    "mozilla/5.0 (Macintosh; Intel Mac OS ..."]user_agent=Random.choice (ua_list) Request=urllib.request.Request (URL) #也可以通过调用Request. Add_header () Add/modifies a specific headerrequest.add_header ("user-agent", user_agent) # First letter uppercase, all lowercase request.get_header after ("user-agent") Response=Urllib.request.urlopen (Request) HTML=Response.read () print (Html.decode ('Utf-8'))

Get request URL address is in Chinese, URL address is str

 fromurllib Import Request,parseurl="http://www.baidu.com/s"#User-Agent is the first step in anti-reptile struggle list_headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36"}keyword= Input ("Please enter the keyword you want to query:") Dict={    'WD': keyword} #parse. UrlEncode (dict) parameter is a dictionary type dict = Parse.urlencode (dict)#wd=%e4%bc%a0%e6% About%ba%e6% the%ad%e5%ae%a2print (dict) #dict为字符类型strprint (Type (dict)) Fulurl=url+'?'+dictprint (fulurl) #fulurl=https://www.baidu.com/s?wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2req=request. Request (url=fulurl,headers =list_headers) Respose=Request.urlopen (req) print (Respose.read (). Decode ('Utf-8'))

Bulk Crawl Bar Page data

First we create a python file, tiebaspider.py, what we want to do is to enter a Baidu post address, such as:

Baidu stick lol Bar The first page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=0

Second page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=50

Third page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=100

Find the law, bar in each page is different, is the URL of the last PN value, the rest are the same, we can seize the law.

Import Urllibimport urllib.request fromurllib import parsedef loadPage (URL, filename):"""role: Send a request based on a URL to get the server response file URL: URL address to crawl filename: filename processed"""# Print (type (filename)) print ("is downloading"+filename) Headers= {"user-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11"} request= Urllib.request.Request (url, headers =headers)returnUrllib.request.urlopen (Request). Read () def writepage (HTML, filename):"""role: Write HTML content to local HTML: Server file contents"""Print"is saving"+filename) # file Write, File open method has a problem, the previous open statement is modified to binary mode with open (filename,"wb+") asf:f.write (HTML) print ("-"* -) def tiebaspider (URL, beginpage, endpage):"""role: Paste the crawler Scheduler, responsible for the combination of processing each page URL URL: Paste the first part of the URL beginpage: Start page endpage: End page"""     forPageinchRange (Beginpage, EndPage +1): PN= (Page-1) * -filename="Section"+ str (page) +"page. html"# Print (type (filename)) FullUrl= URL +"&pn="+STR (PN) HTML=loadPage (fullurl, filename) writepage (HTML, filename) print ("Thanks for using")if__name__ = ="__main__": kw= Input ("Please enter the name of the bar you need to crawl:") Beginpage=int(Input ("Please enter the start page:")) EndPage=int(Input ("Please enter the end page:")) URL="http://tieba.baidu.com/f?"Key= Parse.urlencode ({"kw": kw}) FullUrl= URL +Key Tiebaspider (FullUrl, Beginpage, EndPage)

Import Urllibimport urllib.request fromurllib Import parse# The target of the POST request Urlurl="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc& Sessionfrom=null"Headers={"user-agent":"Mozilla ...."}

#formadata为post参数Formdata= {    "type":"AUTO",    "I":"I love Python",    "DOCTYPE":"JSON",    "xmlversion":"1.8",    "Keyfrom":"Fanyi.web",    "UE":"UTF-8",    "Action":"Fy_by_enter",    "Typoresult":"true"}
#post提交的参数为字节 Data =bytes (Parse.urlencode (formdata), encoding= ' Utf-8 ' )request= Urllib.request.Request (url,data=data, headers =headers) Response=Urllib.request.urlopen (Request) print (Response.read ())

 fromurllib Import Parseimport urllib.requesturl="https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action"Headers= {"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"}formdata= {        "Start":"0",        "Limit":" -"}data= Bytes (Parse.urlencode (formdata), encoding='Utf-8') Request=urllib.request.request (URL, data = data, headers =headers) Print (Urllib.request.urlopen (request). Read ())

6.python3 Crawler's Urllib Library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More