Crawling a Web site often involves the need to sign in, which is how you need to use a simulated login. Python provides a powerful URL library, and it's not hard to do that. Here is a simple example of logging in to the school administration system.
The first thing to understand is the role of cookies, which are data that some websites store on the user's local terminal in order to identify the user and track the session. So we need to use the Cookielib module to keep cookies on our website.
This is the address to be logged in http://202.115.80.153/and verify code address http://202.115.80.153/CheckCode.aspx
It can be found that this verification code is dynamic update every time the opening is different, generally this code and cookies are synchronized. Secondly want to identify the verification code is definitely a thankless thing, so our idea is to first access the verification code page, save the verification code, get a cookie to log in, and then directly to the login address post data.
First, the request and header information for the post is analyzed via the grab kit or Firefox or Google browser. Take Google Chrome as an example.
It can be seen that the URL to post is not a page to visit, but a http://202.115.80.153/default2.aspx,
Where you need to submit the form data in txtUserName and TextBox2 separate user names and passwords.
Now go directly to the key section on the code!!
Import urllib2import cookielibimport urllibimport reimport sys "Analog login" "Reload (SYS) sys.setdefaultencoding (" Utf-8 ") # Prevent Chinese error Captchaurl = "http://202.115.80.153/CheckCode.aspx" PostURL = "http://202.115.80.153/default2.aspx" # Verification code address and post Address cookie = Cookielib. Cookiejar () handler = Urllib2. Httpcookieprocessor (cookie) opener = Urllib2.build_opener (handler) # binds cookies to a opener Cookies are automatically managed by cookielib username = ' username ' password = ' password123 ' # User name and password picture = Opener.open (Captchaurl). Read () # Access the CAPTCHA address with openr, get cookielocal = open (' e:/image.jpg ', ' WB ') local.write (picture) Local.close () # Save Captcha to local Secretcode = Raw_ Input (' Enter Verification Code: ') # Open the saved captcha picture input postdata = {' __viewstate ': ' ddwyode2ntm0otg7oz6ph0twzk5t0lupp/tla1l+rml83g== ', ' txtUserName ': username, ' TextBox2 ': password, ' txtsecretcode ': Secretcode, ' RadioButtonList1 ': ' Student ', ' Button1 ': ', ' Lblanguage ': ', ' Hidpdrs ': ', ' HIDSC ': ',}# constructs the form according to the packet capture information headers = {' Accept ': ' Text/html,application/xhtml+xml, application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-language ': ' Zh-cn,zh;q=0.8 ', ' Connection ': ' keep-alive ', ' content-type ': ' application/x-www-form-urlencoded ', ' user-agent ': ' Mozilla /5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/46.0.2490.86 safari/537.36 ',}# constructs headersdata based on packet capture information = Urllib.urlencode (postdata) # Generate post data? key1=value1&key2=value2 form request = Urllib2. Request (PostURL, data, headers) # constructs request requests Try:response = Opener.open (Request) result = Response.read (). Decode (' gb2312 ') # because the page is gb2312 encoded, you need to decode print result# after printing the login page except Urllib2. Httperror, E:print e.code# Use the opener login page where the cookie was previously stored
After successful login, you can use the Openr to access other pages that need to be logged in to access.