For most of the forums, we want to crawl the post analysis, you need to log in first, otherwise you cannot view.
This is because the HTTP protocol is a stateless (stateless) protocol, how does the server know if the user who is currently requesting the connection is already logged in? There are two ways of doing this:
- Use the Session ID explicitly in the URI;
- Using cookies, the approximate process is to log on to a website and keep a cookie locally, and when you continue browsing the site, the browser sends the cookie along with the address request.
Python provides a fairly rich module, so it can be done in just a few words for this kind of network operation. I use the login QZZN forum as an example, in fact, the following program almost all of the phpwind types of forums are applicable.
#-*-coding:gb2312-*- fromUrllibImportUrlEncodeImportCookielib, Urllib2#CookiesCJ =Cookielib. Lwpcookiejar () opener=Urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) Urllib2.install_opener (opener)#LoginUser_data = {'Pwuser':'your user name', 'pwpwd':'your password.', 'Step':'2'}url_data=UrlEncode (user_data) Login_r= Opener.open ("http://bbs.qzzn.com/login.php", Url_data)
Some notes:
- Urllib2 is obviously a bit more advanced than the Urllib module, which includes how to use Cookies.
- In Urllib2, each client can be abstracted with a single opener, and each opener can add multiple handler to enhance its functionality.
- Httpcookieprocessor is specified as handler when constructing opener, so this handler supports cookies.
- Using Isntall_opener, this opener is used when calling Urlopen.
- If you do not need to save COOKIE,CJ This parameter can be omitted.
- User_data Store is the information needed to log in to the forum when the information passed on the line.
- The UrlEncode function is to encode the dictionary user_data into a "? Pwuser=username&pwpwd=password" form, which is done to make the program easier to read.
The last question is where the names of Pwuser and Pwpwd are coming from, and this is about analyzing the pages that need to be logged in. We know that the general login interface is a form, excerpt as follows:
<form action="login.php?"Method="Post"Name="Login"onsubmit="this.submit.disabled = true;"> <input type="Hidden"Value=""Name="forward"/> <input type="Hidden"Value="http://bbs.qzzn.com/index.php"Name="Jumpurl"/> <input type="Hidden"Value="2"Name="Step"/> ... <TD width="20%"onclick="Document.login.pwuser.focus ();"><input type="Radio"Name="LGT"Value="0"Checked/> User name <input type="Radio"Name="LGT"Value="1"/>uid</td> <td><inputclass="input"Type="text"Maxlength=" -"Name="Pwuser"Size=" +"tabindex="1"/> <a href="reg1ster.php"> Register now </a></td> <td> password </td> <td><inputclass="input"Type="Password"Maxlength=" -"Name="pwpwd"Size=" +"tabindex="2"/> <a href="sendpwd.php"target="_blank"> Recover Password </a></td> ... </form>
As can be seen here, we need to enter the user name password corresponds to Pwuser and pwpwd, and step corresponding to the login (this is to try out).
Note that this forum form is the Post method, if it is get method of this article needs to change, not directly open, but should first Request, and then open. Please see the manual for more details ...
(Transferred from: http://www.cnblogs.com/huangcong/archive/2011/08/30/2160083.html)
Sign in to a website using Python (GO)