Login Web crawler (keep cookies intact)

Source: Internet
Author: User
Tags cas

Usually need to go to school information portal to check the timetable and other information, so want to do a crawler, can automatically log in for me and get this information, so today wrote a crawler:

First sign in to the school's information portal: http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn

Then I randomly enter the account name and password to see what the browser did when I signed in. Here I use Firefix browser and Httpfox plug-in, if using Chrome, Google also has a great plug-in, ie words recommended httpwatch.

From the Httpfox we can analyze the approximate process, the first is the browser based on the name of the input content in the HTML form, and then submit to a server address via post, and then the server to determine whether the user name password is correct, and then respond accordingly. Next we need to know what data the browser is submitting to, and we'll click on the first step in Httpfox:

You can see that the browser has submitted the data to the current page, and has a cookie, and we'll see what's in PostData.

Can see not only the user name password in PostData, there are some other data, here we can write a crawler in Python, only the user name password, and then you will find the server or return to you a login page, this time we need to consider the LT Dllt these postdata, But what are these? I looked up some information, LT can be understood that every user who needs to log in has a serial number. Only with a valid serial number issued by Webflow, the user can indicate that it has entered the webflow process. Otherwise, without the serial number, Webflow will assume that the user has not entered the webflow process, and will re-enter the Webflow process, which will re-appear the login interface.

So how to get this LT data. We go back to http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn, press F12,

We can easily find the user name and password of two input, we are looking for the input tag:

Found at the bottom of the form there are these hidden fields, and now we have got the serial number, but there is a problem: I first send a GET, and then I get this hidden domain all value, and then I need to send a post mode, this time, We have previously obtained the LT value is no longer the current LT value, so this time we will use the requests session method to keep the cookie unchanged, the session method can let the same instance issued by all requests to maintain the same cookie.

The next task is to do:

#Encode=utf8" "Created on October 15, 2016 @author:wanghui@note:view things from Wuhanuniversity" "ImportRequests fromHttp.cookiejarImportCookiejar fromBs4ImportBeautifulSoupclassWhuhelper (object):__loginuri='http://cas.whu.edu.cn/authserver/login?service=http://my.whu.edu.cn'    __logindo='http://yjs.whu.edu.cn'    #initialize constructors, accounts and passwords    def __init__(self,name="', password="'):        #Account name        if  notisinstance (NAME,STR):RaiseTypeError ('Please enter a string')        Else: Self.name=nameifisinstance (password,int): Self.password=str (password)elifisinstance (password, str): Self.password=PasswordElse:            RaiseTypeError ('Please enter a string')    #returns a response after a successful login    def __getresponseafterlogin(self):#simulate a browser headerheader={'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) gecko/20100101 firefox/47.0'}        #keep the cookie intact and then visit this page agains=requests. Session ()#Cookiejar can help us automatically process cookiess.cookies=Cookiejar ()#get a response object, but not logged in at this timeR=s.get (self.__loginuri, headers=header)#get postdata should have LT        #this uses the BeautifulSoup object to parse the XMLDic={} LT=beautifulsoup (R.text,'Html.parser')         forLineinchLt.form.findAll ('input'):            if(line.attrs['name']!=None): dic[line.attrs['name']]=line.attrs['value'] Params={            'username': Self.name,'Password': Self.password,'Dllt':'Usernamepasswordlogin',            'LT':d ic['LT'],            'Execution':d ic['Execution'],            '_eventid':d ic['_eventid'],            'Rmshown':d ic['Rmshown']}        #re-login with built-in postdata to update cookiesR=s.post (self.__loginuri, data=params,headers=header)#returns the response after login        returns#get the post-graduate information portal in the specified category under the HTML    def __gethtmlofperson(self): s=self.__getresponseafterlogin() Personuri='http://yjs.whu.edu.cn/ssfw/index.do#'R=s.get (Personuri)returnR.text#access to postgraduate personal information    defgetpersoninfor (self): s=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') DiC={}        #get access to basic information get way URL siteJbxxuri=self.__logindo+bs.find ('a',{'text':'Basic Information'}). attrs['URL'] R=s.get (Jbxxuri) BS=beautifulsoup (R.text,'Html.parser') dic['School Number']=bs.find ('input',{'name':'Jbxx.xh'}). attrs['value'] dic['name']=bs.find ('input',{'name':'JBXX.XM'}). attrs['value']        returnDiC#get a personal schedule    defGetClassInfo (self):#Initialize the timetableclassinfo=[] Classtitle=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']         forIinchRange (13): Singleclass=[]             forJinchRange (7): Singleclass.append ("') classinfo.append (singleclass)#first get the request after landingS=self.__getresponseafterlogin() BS=beautifulsoup (self.__gethtmlofperson(),'Html.parser') JBXXKB=self.__logindo+bs.find ('a',{'text':'my schedule.'}). attrs['URL'] R=s.get (JBXXKB) BS=beautifulsoup (R.text,'Html.parser')        #get 13 lessons per dayTrs=bs.find ('Table',{'class':'Table_con'}). FindAll ('TR',{'class':'T_con'})         forIinchRange (len): TDs=trs[i].findall ('TD')            #indicates the day of the weekj=0 forTdinchTDs:#first remove row and column headings from table                #according to the law, all the headings contain the B-tags.                ifTd.find ('b')!=None:Continue                #BeautifulSoup will parse   into \a0, so here we need to transcode first and then encodeClassinfo[i][j]=str (Td.get_text ()). Encode ('GBK','Ignore'). Decode ('GBK') J=j+1classinfo.insert (0, Classtitle)returnClassInfo

Of course, this class is not perfect, I just want to see my schedule, if you need to see other information, you can make that requests again send the request, and then use BEAUTIFULSOUP4 resolution.

Here we can test:

Reference article:

https://my.oschina.net/u/1177799/blog/491645

http://beautifulsoup.readthedocs.io/zh_CN/latest/

http://m.blog.csdn.net/article/details?id=51628649

http://docs.python-requests.org/en/latest/user/advanced/#session-objects

Login Web crawler (keep cookies intact)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.