Login Web crawler (keep cookies intact)

Last Update:2016-10-16 Source: Internet

Author: User

Tags cas

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Usually need to go to school information portal to check the timetable and other information, so want to do a crawler, can automatically log in for me and get this information, so today wrote a crawler:

First sign in to the school's information portal: http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn

Then I randomly enter the account name and password to see what the browser did when I signed in. Here I use Firefix browser and Httpfox plug-in, if using Chrome, Google also has a great plug-in, ie words recommended httpwatch.

From the Httpfox we can analyze the approximate process, the first is the browser based on the name of the input content in the HTML form, and then submit to a server address via post, and then the server to determine whether the user name password is correct, and then respond accordingly. Next we need to know what data the browser is submitting to, and we'll click on the first step in Httpfox:

You can see that the browser has submitted the data to the current page, and has a cookie, and we'll see what's in PostData.

Can see not only the user name password in PostData, there are some other data, here we can write a crawler in Python, only the user name password, and then you will find the server or return to you a login page, this time we need to consider the LT Dllt these postdata, But what are these? I looked up some information, LT can be understood that every user who needs to log in has a serial number. Only with a valid serial number issued by Webflow, the user can indicate that it has entered the webflow process. Otherwise, without the serial number, Webflow will assume that the user has not entered the webflow process, and will re-enter the Webflow process, which will re-appear the login interface.

So how to get this LT data. We go back to http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn, press F12,

We can easily find the user name and password of two input, we are looking for the input tag:

Found at the bottom of the form there are these hidden fields, and now we have got the serial number, but there is a problem: I first send a GET, and then I get this hidden domain all value, and then I need to send a post mode, this time, We have previously obtained the LT value is no longer the current LT value, so this time we will use the requests session method to keep the cookie unchanged, the session method can let the same instance issued by all requests to maintain the same cookie.

The next task is to do:

#Encode=utf8" "Created on October 15, 2016 @author:wanghui@note:view things from Wuhanuniversity" "ImportRequests fromHttp.cookiejarImportCookiejar fromBs4ImportBeautifulSoupclassWhuhelper (object):__loginuri='http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn'    __logindo='http://yjs.whu.edu.cn'    #initialize constructors, accounts and passwords    def __init__(self,name="', password="'):        #Account name        if  notisinstance (NAME,STR):RaiseTypeError ('Please enter a string')        Else: Self.name=nameifisinstance (password,int): Self.password=str (password)elifisinstance (password, str): Self.password=PasswordElse:            RaiseTypeError ('Please enter a string')    #returns a response after a successful login    def __getresponseafterlogin(self):#simulate a browser headerheader={'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) gecko/20100101 firefox/47.0'}        #keep the cookie intact and then visit this page agains=requests. Session ()#Cookiejar can help us automatically process cookiess.cookies=Cookiejar ()#get a response object, but not logged in at this timeR=s.get (self.__loginuri, headers=header)#get postdata should have LT        #this uses the BeautifulSoup object to parse the XMLDic={} LT=beautifulsoup (R.text,'Html.parser')         forLineinchLt.form.findAll ('input'):            if(line.attrs['name']!=None): dic[line.attrs['name']]=line.attrs['value'] Params={            'username': Self.name,'Password': Self.password,'Dllt':'Usernamepasswordlogin',            'LT':d ic['LT'],            'Execution':d ic['Execution'],            '_eventid':d ic['_eventid'],            'Rmshown':d ic['Rmshown']}        #re-login with built-in postdata to update cookiesR=s.post (self.__loginuri, data=params,headers=header)#returns the response after login        returns#get the post-graduate information portal in the specified category under the HTML    def __gethtmlofperson(self): s=self.__getresponseafterlogin() Personuri='http://yjs.whu.edu.cn/ssfw/index.do#'R=s.get (Personuri)returnR.text#access to postgraduate personal information    defgetpersoninfor (self): s=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') DiC={}        #get access to basic information get way URL siteJbxxuri=self.__logindo+bs.find ('a',{'text':'Basic Information'}). attrs['URL'] R=s.get (Jbxxuri) BS=beautifulsoup (R.text,'Html.parser') dic['School Number']=bs.find ('input',{'name':'Jbxx.xh'}). attrs['value'] dic['name']=bs.find ('input',{'name':'JBXX.XM'}). attrs['value']        returnDiC#get a personal schedule    defGetClassInfo (self):#Initialize the timetableclassinfo=[] Classtitle=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']         forIinchRange (13): Singleclass=[]             forJinchRange (7): Singleclass.append ("') classinfo.append (singleclass)#first get the request after landingS=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') JBXXKB=self.__logindo+bs.find ('a',{'text':'my schedule.'}). attrs['URL'] R=s.get (JBXXKB) BS=beautifulsoup (R.text,'Html.parser')        #get 13 lessons per dayTrs=bs.find ('Table',{'class':'Table_con'}). FindAll ('TR',{'class':'T_con'})         forIinchRange (len): TDs=trs[i].findall ('TD')            #indicates the day of the weekj=0 forTdinchTDs:#first remove row and column headings from table                #according to the law, all the headings contain the B-tags.                ifTd.find ('b')!=None:Continue                #BeautifulSoup will parse &nbsp; into \a0, so here we need to transcode first and then encodeClassinfo[i][j]=str (Td.get_text ()). Encode ('GBK','Ignore'). Decode ('GBK') J=j+1classinfo.insert (0, Classtitle)returnClassInfo

Of course, this class is not perfect, I just want to see my schedule, if you need to see other information, you can make that requests again send the request, and then use BEAUTIFULSOUP4 resolution.

Here we can test:

Reference article:

https://my.oschina.net/u/1177799/blog/491645

http://beautifulsoup.readthedocs.io/zh_CN/latest/

http://m.blog.csdn.net/article/details?id=51628649

http://docs.python-requests.org/en/latest/user/advanced/#session-objects

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More