Usually need to go to school information portal to check the timetable and other information, so want to do a crawler, can automatically log in for me and get this information, so today wrote a crawler:
First sign in to the school's information portal: http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn
Then I randomly enter the account name and password to see what the browser did when I signed in. Here I use Firefix browser and Httpfox plug-in, if using Chrome, Google also has a great plug-in, ie words recommended httpwatch.
From the Httpfox we can analyze the approximate process, the first is the browser based on the name of the input content in the HTML form, and then submit to a server address via post, and then the server to determine whether the user name password is correct, and then respond accordingly. Next we need to know what data the browser is submitting to, and we'll click on the first step in Httpfox:
You can see that the browser has submitted the data to the current page, and has a cookie, and we'll see what's in PostData.
Can see not only the user name password in PostData, there are some other data, here we can write a crawler in Python, only the user name password, and then you will find the server or return to you a login page, this time we need to consider the LT Dllt these postdata, But what are these? I looked up some information, LT can be understood that every user who needs to log in has a serial number. Only with a valid serial number issued by Webflow, the user can indicate that it has entered the webflow process. Otherwise, without the serial number, Webflow will assume that the user has not entered the webflow process, and will re-enter the Webflow process, which will re-appear the login interface.
So how to get this LT data. We go back to http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn, press F12,
We can easily find the user name and password of two input, we are looking for the input tag:
Found at the bottom of the form there are these hidden fields, and now we have got the serial number, but there is a problem: I first send a GET, and then I get this hidden domain all value, and then I need to send a post mode, this time, We have previously obtained the LT value is no longer the current LT value, so this time we will use the requests session method to keep the cookie unchanged, the session method can let the same instance issued by all requests to maintain the same cookie.
The next task is to do:
#Encode=utf8" "Created on October 15, 2016 @author:wanghui@note:view things from Wuhanuniversity" "ImportRequests fromHttp.cookiejarImportCookiejar fromBs4ImportBeautifulSoupclassWhuhelper (object):__loginuri='http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn' __logindo='http://yjs.whu.edu.cn' #initialize constructors, accounts and passwords def __init__(self,name="', password="'): #Account name if notisinstance (NAME,STR):RaiseTypeError ('Please enter a string') Else: Self.name=nameifisinstance (password,int): Self.password=str (password)elifisinstance (password, str): Self.password=PasswordElse: RaiseTypeError ('Please enter a string') #returns a response after a successful login def __getresponseafterlogin(self):#simulate a browser headerheader={'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) gecko/20100101 firefox/47.0'} #keep the cookie intact and then visit this page agains=requests. Session ()#Cookiejar can help us automatically process cookiess.cookies=Cookiejar ()#get a response object, but not logged in at this timeR=s.get (self.__loginuri, headers=header)#get postdata should have LT #this uses the BeautifulSoup object to parse the XMLDic={} LT=beautifulsoup (R.text,'Html.parser') forLineinchLt.form.findAll ('input'): if(line.attrs['name']!=None): dic[line.attrs['name']]=line.attrs['value'] Params={ 'username': Self.name,'Password': Self.password,'Dllt':'Usernamepasswordlogin', 'LT':d ic['LT'], 'Execution':d ic['Execution'], '_eventid':d ic['_eventid'], 'Rmshown':d ic['Rmshown']} #re-login with built-in postdata to update cookiesR=s.post (self.__loginuri, data=params,headers=header)#returns the response after login returns#get the post-graduate information portal in the specified category under the HTML def __gethtmlofperson(self): s=self.__getresponseafterlogin() Personuri='http://yjs.whu.edu.cn/ssfw/index.do#'R=s.get (Personuri)returnR.text#access to postgraduate personal information defgetpersoninfor (self): s=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') DiC={} #get access to basic information get way URL siteJbxxuri=self.__logindo+bs.find ('a',{'text':'Basic Information'}). attrs['URL'] R=s.get (Jbxxuri) BS=beautifulsoup (R.text,'Html.parser') dic['School Number']=bs.find ('input',{'name':'Jbxx.xh'}). attrs['value'] dic['name']=bs.find ('input',{'name':'JBXX.XM'}). attrs['value'] returnDiC#get a personal schedule defGetClassInfo (self):#Initialize the timetableclassinfo=[] Classtitle=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'] forIinchRange (13): Singleclass=[] forJinchRange (7): Singleclass.append ("') classinfo.append (singleclass)#first get the request after landingS=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') JBXXKB=self.__logindo+bs.find ('a',{'text':'my schedule.'}). attrs['URL'] R=s.get (JBXXKB) BS=beautifulsoup (R.text,'Html.parser') #get 13 lessons per dayTrs=bs.find ('Table',{'class':'Table_con'}). FindAll ('TR',{'class':'T_con'}) forIinchRange (len): TDs=trs[i].findall ('TD') #indicates the day of the weekj=0 forTdinchTDs:#first remove row and column headings from table #according to the law, all the headings contain the B-tags. ifTd.find ('b')!=None:Continue #BeautifulSoup will parse into \a0, so here we need to transcode first and then encodeClassinfo[i][j]=str (Td.get_text ()). Encode ('GBK','Ignore'). Decode ('GBK') J=j+1classinfo.insert (0, Classtitle)returnClassInfo
Of course, this class is not perfect, I just want to see my schedule, if you need to see other information, you can make that requests again send the request, and then use BEAUTIFULSOUP4 resolution.
Here we can test:
Reference article:
https://my.oschina.net/u/1177799/blog/491645
http://beautifulsoup.readthedocs.io/zh_CN/latest/
http://m.blog.csdn.net/article/details?id=51628649
http://docs.python-requests.org/en/latest/user/advanced/#session-objects
Login Web crawler (keep cookies intact)