Hahaha, demo login successful La La la ~~~~~
The important thing is to say three times, but still forget = =
First on:
As we all know, many websites have been set up to get the right to view the page, so the simulation login is the first step to crawl information, this step succeeded, hey, just do it!
OK, nonsense not much to say, direct focus:
First, you should understand the process of website Login and the information you need to post , take the watercress as an example:
Source:movieredir:https://movie.douban.com/mine?status=collectform_email:usernameform_password: Passwordcaptcha-solution:dresscaptcha-id:6rp40cbjzngdjuqogm3y6wns:enlogin: Login
This is the information you need to submit, including the user name and password, as well as the ID of the verification Code and verification Code, see this may be someone will think how I know the ID of the verification code, you can rest assured that in the page load has been to the client side, that is, you could see directly from the browser, is not cool!
The second step, need to understand some requests This library, because requests directly eliminates a lot of urllib and urllib2 many a lot of trouble, save a lot of redundant code, as the official website said,Requests:http for Human, this is for human use = =
Website address: Requests
If you have ever known re and BS4, well, direct coding it!
Otherwise it would be better to get to know Bs and save a lot of trouble, help document address: BeautifulSoup
Talking is cheap,show me the code. Now is showtime!
#-*-encoding:utf-8-*-############################# #__author__ = "Andrewseu" __date__ = "2015/8/3" ################## ############ #import requestsfrom bs4 import beautifulsoupimport urllibimport reloginurl = ' http://accounts.douban.com/ Login ' formdata={' redir ': ' Http://movie.douban.com/mine?status=collect ', ' form_email ': username, ' Form_password ':p a ssWOrd, "Login": U ' login '}headers = {"User-agent": ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chro me/43.0.2357.134 safari/537.36 '}r = requests.post (loginurl,data=formdata,headers=headers) page = R.text#print R.url ' ' Get authenticode picture ' #利用bs4获取captcha地址soup = BeautifulSoup (page, "Html.parser") captchaaddr = Soup.find (' img ', id= ' Captcha_ Image ') [' src '] #利用正则表达式获取captcha的IDreCaptchaID = R ' <input type= "hidden" name= "Captcha-id" value= "(. *?)" /' Captchaid = Re.findall (recaptchaid,page) #print captchaid# saved to local urllib.urlretrieve (captchaaddr, "captcha.jpg") Captcha = raw_input (' Please input the captcha: ') formdata[' captcha-solution '] = CAptchaformdata[' Captcha-id ' = CAPTCHAIDR = requests.post (loginurl,data=formdata,headers=headers) page = R.textif R.url = = ' Http://movie.douban.com/mine?status=collect ': print ' Login successfully!!! ' print ' I've seen the movie ', '-' *60 #获取看过的电影 soup = beautifulsoup (page, "html.parser") result = Soup.findall (' li ', attrs={"class" : "title"}) #print result for item in Result:print item.find (' a '). Get_text () else:print "failed!"
Have any do not understand the place, welcome to communicate with me!
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python crawler Simulation Login watercress Get the movies you've seen recently