Brief introduction:URLLIB2 is a Python module that obtains a URL (Uniform Resource locators, a unified resource addressable device). It provides a very concise interface in the form of a urlopen function. This makes it possible to obtain URLs with a variety of protocols. It also provides a slightly more complex interface to handle common situations-such as basic authentication, cookies, proxies, and so on. These are handled by objects called opener an
://www.example.com/') req.add_header (' Referer ', ' http://www.python.org/') #http是无状态的协议, The last request from the client is not related to the next client-to-server request, and most omit this step R = Urllib2.urlopen (req)Openerdirector automatically adds a user-agent header for each request, So the second method is as follows (Urllib2.buildopener will return a Openerdirector object, about the Urllib2.buildopener Class):Import Urllib2opener = Urllib2.build_opener () opener.addheaders = [('
Python crawler Cookie usage, pythoncookie
Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)
For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save the cookies we log on to, and then capture other pages to achieve our goal.
Before that, I would like to introduce the concept of an
% E7 % 86% 8A % E7 % 8C % AB "for I in range (1,481 ): a = urllib2.Request (url1 + str (I) + url3) B = urllib2.urlopen (a) path = str ("D:/noval/Skyeye Uploader" + str (I) + ". html ") c = open (path," w + ") code = B. read () c. write (code) c. close print "current number of downloaded pages:", I
In fact, the above Code uses urlopen to achieve the same effect:
Import urllib2 url1 = "http://bbs.tianya.cn/post-16-835537-" url3 = ". shtml # ty_vip_look [% E6 % 8B % 89% E9 % A3 % 8E % E7 % 86% 8
holds the user name and password to be processed
passwdmgr = urllib2. Httppasswordmgrwithdefaultrealm ()
# 2. Add account information, the first parameter realm is the domain information related to the remote server, generally no one cares that it is write none, the following three parameters are Proxy server, username, password
Passwdmgr.add_password (None, ProxyServer, User, passwd)
# 3. Build a proxy base user name/password Authentication Proxybasicauthhandler Processor Object , the parame
Before you start, explain the two methods in Urllib2: Info/geturlUrlopen returns an Answer object response (or Httperror instance) has two very useful methods info () and Geturl ()
1.geturl ():
This returns the real URL obtained, which is useful because the Urlopen (or the opener object) may be redirected. The URL you get may be different from the request URL.As an example of a super link in everyone,Let's build a urllib2_test10.py to compare the or
Analog Grep-rl "python" F:\xuyaping this command#查看xuyaping文件夹所有的绝对路径import Osg=os.walk ("f:\\xuyaping") #g为迭代器for i in G: # Print (i) #i为文件路径 for J in I[-1]: file_path= "%s\\%s"% (i[0],j) print (File_path)Program Output Result:F:\xuyaping\xuyaping.txt.txtf:\xuyaping\xuyaping1.txt.txtf:\xuyaping\a\a.txt.txtf:\xuyaping\a\a1\a1.txt.txtf:\ Xuyaping\a\a1\a2\a2.txt.txtf:\xuyaping\b\b.txt.txtThe code is as follows:#模拟grep-rl "python" F:\xuyaping this command import os,time# init
Two important concepts in urllib2: Openers, Handlers, and urllib2openers
Before starting the following content, let's first explain the two methods in urllib2: info/geturlThe response object response (or HTTPError instance) returned by urlopen has two useful methods: info () and geturl ()
1. geturl ():
This returns the obtained real URL, which is useful because urlopen (or used by the opener object) may be redirected. The obtained URL may be different
This article first introduces two methods of urllib2, and then introduces in detail the two important concepts of urllib2: Openers and Handlers. we hope to help you before you start the following content, the following two methods in urllib2 are described: info/geturl.
The response object response (or HTTPError instance) returned by urlopen has two useful methods: info () and geturl ()
1. geturl ():
This returns the obtained real URL, which is useful because urlopen (or used by the
displaying the homepage content. The following is an example in this document. We will transform this example to implement the functions we want.
Import cookielib, urllib2cj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) r = opener. open ("http://example.com/") # coding: utf-8import urllib2, urllibimport cookieliburl
' Error code: ', E.code except Urlerror, E: print ' We failed to reach a server. ' print ' Reason: ', E.reason else: # everything is fine
Note: Except Httperror must be in the first, otherwise except Urlerror will likewise receive to Httperror.
The second one:
From URLLIB2 import Request, Urlopen, urlerror req = Request (Someurl) Try: response = Urlopen (req) except Urlerror, E: if Hasattr (E, ' reason '): print ' We failed to reach a server. ' print ' Reason
= url + '? '+ Dataresponse = urllib2.urlopen (url2) the_page = response. read () print the_page
The following example describes how to use cookies by simulating logon to Renren and then displaying the homepage content. The following is an example in this document. we will transform this example to implement the functions we want.
Import cookielib, urllib2cj = cookielib. cookieJar () opener = urllib2.build _ ope
://www.pixiv.net/login.php? Return_to = 0 "," user-agent ":" mozilla/5.0 (windows nt 10.0; win64; x64; rv: 45.0) gecko/20100101 firefox/45.0 "}
Because it is To Crawl multiple page graphs, I use the cookie login method here, but because the cookie may change to run every time, you have to log on again:
Cookie = http. cookiejar. mozillaCookieJar (". cookie ") # update handler = urllib when the cookie is overwritten every time. request. HTTPCookieProcessor (cookie)
All along the technical group will have new students to ask questions about Urllib and URLLIB2 and cookielib related issues. So I'm going to summarize here and avoid wasting resources by answering the same questions over and over again.This is a tutorial class text, if you already know urllib2 and cookielib so please ignore this article.First, start with a piece of code,#cookieimport urllib2import Cookielibcookie = cookielib. Cookiejar () opener = Url
Cookies are used for server session, user logon, and status management. This article mainly introduces how to process cookies using python, if you are interested, you can refer to the previous article to learn about crawler Exception Handling. Next, let's take a look at how to use cookies.
Why use cookies?
Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)
For example, some websites need to log on before they
Python crawler cookie usage, pythoncookie
In the previous article, we learned about crawler Exception Handling, so let's take a look at how to use cookies.
Why use cookies?
Cookie refers to the data (usually encrypted) stored on the user's local terminal by some websites to identify users and track sessions)
For example, some websites need to log on before they can access a page. Before you log on, it is not allowed to capture the content of a page. Then we can use the Urllib2 library to save th
Before we start, let's explain two methods in Urllib2: Info/geturl
Response (or Httperror instance) of the reply object returned by Urlopen has two useful methods, info () and Geturl ()
1.geturl ():
This returns the actual URL obtained, which is useful because Urlopen (or opener objects used) may have redirects. The URL you get may be different from the request URL.Take a hyperlink from everyone, for example,Let's build a urllib2_test10.py to compar
ways to access web pages using Python: urllib, urllib2, httplibUrllib is relatively simple and has relatively weak functions. httplib is simple and powerful, but does not seem to support session(1) simplest page accessRes = urllib2.urlopen (URL)Print res. Read ()(2) Add the data to get or postData = {"name": "Hank", "passwd": "hjz "}Urllib2.urlopen (URL, urllib. urlencode (data ))(3) add an HTTP HeaderHeader = {"User-Agent": "Mozilla-Firefox5.0 "}Urllib2.urlopen (URL, urllib. urlencode (data),
always wrong. After the file is output, an Excel file will always be opened in the system, even quit won't work, there is a lot of use in a program, and there is no way to do anything. So I found xlsreadwriteii.
This is really good and easy to use. Both read and write operations are easy. Below is the code in the official demo.
Write File Code, including customization of the format:
[Delphi] view plaincopyprint? View the code piece derived from my code piece on codeXLS. filename: = 'formatsampl
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.