6.python3 Crawler's Urllib Library

Source: Internet
Author: User
Tags file url urlencode

= Urllib.request.urlopen ("http://www.baidu.com"= response.read ()# Print the string, remember to add the decode (' utf-8 ') method, will not appear \ n \print (Html.decode (' Utf-8 '))

Request

In our first example, the parameter of Urlopen () is a URL address;

However, if you need to perform more complex operations, such as adding HTTP headers, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the Request instance.

Import Urllib.requesturl="http://www.itcast.cn"#User-Agent is the first step in anti-reptile struggle list_headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36"The }# URL, together with the headers, constructs the requests request, which is accompanied by a IE9.0Browser's user-agentrequest= Urllib.request.Request (url, headers =list_headers) # Send this request to the server response=Urllib.request.urlopen (Request) HTML=Response.read () print (Html.decode ('Utf-8'))

The browser is the world's accepted identity of the Internet, if we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests.

#添加一个特定的headerimport Urllib.requesturl="http://www.itcast.cn"#IE9.0The user-Agentheader= {"user-agent":"mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0;"}request= Urllib.request.Request (url, headers =header) #也可以通过调用Request. Add_header () Add/modifies a specific headerrequest.add_header ("Connection","keep-alive"# can also be viewed by calling Request.get_header () to view header information # Request.get_header (Header_name="Connection") Response=Urllib.request.urlopen (Request) #查看响应状态码200print (Response.code) HTML=Response.read () print (Html.decode ('Utf-8'))

Randomly Add/Modify User-agent
#随机添加/Modify user-agentimport urllib.requestimport Randomurl="http://www.itcast.cn"ua_list= [    "mozilla/5.0 (Windows NT 6.1;) Apple ....",    "mozilla/5.0 (X11; CrOS i686 2268.111.0) ...",    "mozilla/5.0 (Macintosh; U PPC Mac OS X ....",    "mozilla/5.0 (Macintosh; Intel Mac OS ..."]user_agent=Random.choice (ua_list) Request=urllib.request.Request (URL) #也可以通过调用Request. Add_header () Add/modifies a specific headerrequest.add_header ("user-agent", user_agent) # First letter uppercase, all lowercase request.get_header after ("user-agent") Response=Urllib.request.urlopen (Request) HTML=Response.read () print (Html.decode ('Utf-8'))

Get request URL address is in Chinese, URL address is str

 fromurllib Import Request,parseurl="http://www.baidu.com/s"#User-Agent is the first step in anti-reptile struggle list_headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36"}keyword= Input ("Please enter the keyword you want to query:") Dict={    'WD': keyword} #parse. UrlEncode (dict) parameter is a dictionary type dict = Parse.urlencode (dict)#wd=%e4%bc%a0%e6% About%ba%e6% the%ad%e5%ae%a2print (dict) #dict为字符类型strprint (Type (dict)) Fulurl=url+'?'+dictprint (fulurl) #fulurl=https://www.baidu.com/s?wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2req=request. Request (url=fulurl,headers =list_headers) Respose=Request.urlopen (req) print (Respose.read (). Decode ('Utf-8'))

Bulk Crawl Bar Page data

First we create a python file, tiebaspider.py, what we want to do is to enter a Baidu post address, such as:

Baidu stick lol Bar The first page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=0

Second page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=50

Third page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=100

Find the law, bar in each page is different, is the URL of the last PN value, the rest are the same, we can seize the law.

Import Urllibimport urllib.request fromurllib import parsedef loadPage (URL, filename):"""role: Send a request based on a URL to get the server response file URL: URL address to crawl filename: filename processed"""# Print (type (filename)) print ("is downloading"+filename) Headers= {"user-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11"} request= Urllib.request.Request (url, headers =headers)returnUrllib.request.urlopen (Request). Read () def writepage (HTML, filename):"""role: Write HTML content to local HTML: Server file contents"""Print"is saving"+filename) # file Write, File open method has a problem, the previous open statement is modified to binary mode with open (filename,"wb+") asf:f.write (HTML) print ("-"* -) def tiebaspider (URL, beginpage, endpage):"""role: Paste the crawler Scheduler, responsible for the combination of processing each page URL URL: Paste the first part of the URL beginpage: Start page endpage: End page"""     forPageinchRange (Beginpage, EndPage +1): PN= (Page-1) * -filename="Section"+ str (page) +"page. html"# Print (type (filename)) FullUrl= URL +"&pn="+STR (PN) HTML=loadPage (fullurl, filename) writepage (HTML, filename) print ("Thanks for using")if__name__ = ="__main__": kw= Input ("Please enter the name of the bar you need to crawl:") Beginpage=int(Input ("Please enter the start page:")) EndPage=int(Input ("Please enter the end page:")) URL="http://tieba.baidu.com/f?"Key= Parse.urlencode ({"kw": kw}) FullUrl= URL +Key Tiebaspider (FullUrl, Beginpage, EndPage)

Import Urllibimport urllib.request fromurllib Import parse# The target of the POST request Urlurl="http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc& Sessionfrom=null"Headers={"user-agent":"Mozilla ...."}

#formadata为post参数Formdata= { "type":"AUTO", "I":"I love Python", "DOCTYPE":"JSON", "xmlversion":"1.8", "Keyfrom":"Fanyi.web", "UE":"UTF-8", "Action":"Fy_by_enter", "Typoresult":"true"}
#post提交的参数为字节 Data =bytes (Parse.urlencode (formdata), encoding= ' Utf-8 ' )request= Urllib.request.Request (url,data=data, headers =headers) Response=Urllib.request.urlopen (Request) print (Response.read ())
 fromurllib Import Parseimport urllib.requesturl="https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action"Headers= {"user-agent":"mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.99 safari/537.36"}formdata= {        "Start":"0",        "Limit":" -"}data= Bytes (Parse.urlencode (formdata), encoding='Utf-8') Request=urllib.request.request (URL, data = data, headers =headers) Print (Urllib.request.urlopen (request). Read ())

6.python3 Crawler's Urllib Library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.