First open the network teaching platform of campus network http://eol.zhbit.com/homepage/common/
Find the appropriate form code
The name of the user name is Ipt_loginusername
The name of the password is Ipt_loginpassword
The address submitted is http://www.zhbit.com/homepage/common/login.jsp
Through the browser's capture, it is true that only these two data are submitted
The page becomes this when the submission succeeds
Click to enter
The discovery address has become http://eol.zhbit.com/main.jsp.
That's why we need to get to this address after our bot simulations are successful.
After entering this interface, we see that there are many disciplines (database, object-oriented, analog electronic technology)
There are links to every subject
We also find the corresponding code
According to the above analysis, the following program can be written
=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({ ' ipt_loginusername ': ' Your study number ' , #学号 ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址req = urllib2. Request ( url = ' http://eol.zhbit.com/homepage/common/login.jsp ', data = postdata) Result = opener.open (req) htmlflag = result.read () #匹配 "Welcome to Login" Character Verify if login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.search (htmlflag) if search: #登陆成功后跳转到 '/http Eol.zhbit.com/main.jsp ' result = opener.open (' http://eol.zhbit.com/main.jsp ') #读取该页面的html并存储到homeHtml homehtml = result.read () #编写正则表达式匹配需要获得的内容 pattern = re.compile (R ' <a href= "javascript:void\ (0 \)".*?‘ + ' onclick= ' changecourse\ (. *?) \ ' \) "' + '. *?\n.*? ' + ' title= ' (. *?) " > ') mainlist = pattern.findall (homehtml) for m in mainlist: print m[0].decode (' GBK ') + " " + M[1].decode (' GBK ') else: print "Login Error"
Results after successful operation (this is my course)
/stu_left_course_menu.jsp?lid=24063 database application Technology [02102100]/STU_LEFT_COURSE_MENU.JSP?LID=26510 Object-oriented programming (C + +) [02120011] /STU_LEFT_COURSE_MENU.JSP?LID=26512 analog Electronic Technology [02120021]/stu_left_course_menu.jsp?lid=26902 University English (B) 2 [10120111]/stu_ LEFT_COURSE_MENU.JSP?LID=22771 University Physics Experiment 1 [12110180]/stu_left_course_menu.jsp?lid=27185 University Physics (D) 1 [12120041]/stu_left_ course_menu.jsp?lid=27195 Advanced Mathematics (B) 2 [12120290]/stu_left_course_menu.jsp?lid=23231 Wilderness survival Skills [13120300]/stu_left_course _menu.jsp?lid=27275 Military Training and theoretical education of the Army [21120001]/stu_left_course_menu.jsp?lid=27350 of Western Civilization [39100300]
I output M[0] and m[1 in the program) respectively is the address of the course and the name of the course, the corresponding address of the course is the relative path
Now, we open the database, the course.
#打开数据库页面的地址dataBaseurl = "http://eol.zhbit.com/stu_left_course_menu.jsp?lid=24063" result = Opener.open (databaseurl )
To open a database page using a browser
Find links to course assignments
#打开课程作业页面的地址result = opener.open (' http://eol.zhbit.com/common/hw/student/hwtask.jsp ') # Reads the HTML of the database page and stores it in Databasehtmldatabasehtml = result.read () pattern = re.compile (R ' <td align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ') Hw = pattern.findall (databasehtml) pattern2 = re.compile (R ' <td > (. *?) Years (. *?) Month (. *?) Day </td> ') Ti = pattern2.findall (databasehtml) t = 0for h in hw: print (H.decode (' GBK ') + ' Due date '. Decode (' GBK ') + ti[t][0].decode (' GBK ') + ' year '. Decode (' GBK ') + ti[t][1].decode (' GBK ') + ' month '. Decode (' GBK ') +ti[t][2].decode (' GBK ') + ' Day '. Decode (' GBK ')) t += 1
Run results
/stu_left_course_menu.jsp?lid=24063 database application Technology [02102100]/stu_left_course_menu.jsp?lid=26510 Object-Oriented programming (C + +) [02120011]/stu_left_course_ menu.jsp?lid=26512 Analog Electronic Technology [02120021]/stu_left_course_menu.jsp?lid=26902 College English (B) 2 [10120111 Physics experiment 1 [12110180]/stu_left_course_menu.jsp?lid=27185 of]/stu_left_course_menu.jsp?lid=22771 University University Physics (D) 1 [12120041]/stu_left_course_menu.jsp?lid=27195 Advanced Mathematics (B) 2 [12120290]/stu_left_course_ menu.jsp?lid=23231 Field Survival Skills [13120300]/stu_left_course_menu.jsp?lid=27275 military training and [of Army theory education 21120001]/stu_left_course_menu.jsp?lid=27350 The general Theory of Western Civilization [39100300]2014-2015-2 experiment 7-macro Deadline July 7, 2015 2014-2015-2 Test 6-report deadline June 22, 2015 2014-2015-2 test 5 deadline June 6, 2015 final exam related information upload Deadline July 6, 2015 2014-2015-2 Test 4 deadline May 18, 2015 2014-2015-2 on-machine experiment 3 deadline May 11, 2015 2014-2015-2 on-machine experiment 2 Deadline May 4, 2015 2014-2015-2 on-machine experiment 1 deadline April 30, 2015
Get Total Database Job code
=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({ ' ipt_loginusername ': ' Your study number ' , #学号 ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址user_agent = ' mozilla/4.0 (compatible; msie 5.5; windows nt) ' headers = { ' User-agent ' : user_agent } req = urllib2. Request (url = ' http://eol.zhbit.com/homepage/common/login.jsp ', data = postdata, headers = headers) Result = opener.open (req) htmlflag = result.read () #匹配 " Welcome to Login "Character verification whether login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.seaRCH (htmlflag) if search: #登陆成功后跳转到 ' http://eol.zhbit.com/main.jsp ' result = opener.open (' http://eol.zhbit.com/main.jsp ') # Read the HTML of the page and store it in Homehtml homehtml = result.read () # Write a regular expression match what you need to get pattern = re.compile (R ' <a href= "javascript:void\ (0\) ".*?‘ + ' onclick= ' changecourse\ (. *?) \ ' \) "' + '. *?\n.*? ' + ' title= ' (. *?) " > ') mainlist = pattern.findall (homehtml) for m in mainlist: print m[0].decode (' GBK ') +" " +m[1].decode (' GBK ') #打开数据库页面的地址 dataBaseurl = "http://eol.zhbit.com/stu_left_course_menu.jsp?lid=24063" result = Opener.open (Databaseurl) result = opener.open (' HTTP://EOL.ZHBIT.COM/COMMON/HW /student/hwtask.jsp ') #读取课程作业页面的html并存储到dataBaseHtml Databasehtml = result.read () pattern = re.compile (R ' <td Align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ') hw = pattern.findall (databasehtml) pattern2 = re.compile (R ' <td> (. *?) Years (. *?) Month (. *?) Day </td> ') ti = pattern2.findall (databasehtml) t = 0 for h in hw: pRint (H.decode (' GBK ') + ' Due date '. Decode (' GBK ') + ti[t][0].decode (' GBK ') + ' year '. Decode (' GBK ') + ti[t][1].decode (' GBK ') + ' month '. Decode (' GBK ') +ti [T] [2].decode (' GBK ') + ' Day '. Decode (' GBK ')) t += 1 else: print "Login Error"
Note: Chinese after Add. Decode (' GBK '), because the Chinese in Linux is output with Utf-8, and the campus network is GBK encoded so it needs to be decoded. No need to add under windows!!
Just modify the program to get all the homework.
=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({ ' ipt_loginusername ': ' Your study number ' , #学号 ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址user_agent = ' mozilla/4.0 (compatible; msie 5.5; windows nt) ' headers = { ' User-agent ' : user_agent } req = urllib2. Request (url = ' http://eol.zhbit.com/homepage/common/login.jsp ', data = postdata, headers = headers) Result = opener.open (req) htmlflag = result.read () #匹配 " Welcome to Login "Character verification whether login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.seaRCH (htmlflag) if search: #登陆成功后跳转到 ' http://eol.zhbit.com/main.jsp ' result = opener.open (' http://eol.zhbit.com/main.jsp ') # Read the HTML of the page and store it in Homehtml homehtml = result.read () # Write a regular expression match what you need to get pattern = re.compile (R ' <a href= "javascript:void\ (0\) ".*?‘ + ' onclick= ' changecourse\ (. *?) \ ' \) "' + '. *?\n.*? ' + ' title= ' (. *?) " > ') mainlist = pattern.findall (homehtml) for m in mainlist: print m[1].decode (' GBK ') #打开数据库页面的地址 databaseurl = "Http://eol.zhbit.com" +m[0] result = opener.open (Databaseurl) result = Opener.open (' http://eol.zhbit.com/common/hw/student/hwtask.jsp ') #读取课程作业页面的html并存储到dataBaseHtml dataBaseHtml = Result.read () pattern = re.compile (R ' <td Align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ') hw = pattern.findall ( databasehtml) pattern2 = re.compile (R ' <td> (. *?) Years (. *?) Month (. *?) Day </td> ') ti = pattern2.findall (databasehtml) t = 0 for h in hw: print (H.decode (' GBK ') + ' Due date '. Decode (' GBK ') + ti[t][0].decode (' GBK ') + ' year '. Decode (' GBK ') + ti[t][1].decode (' GBK ') + ' month '. Decode (' GBK ') +ti[t][2].decode (' GBK ') + ' Day '. Decode (' GBK ') t += 1 print "\ n" else: print "Login Error"
?
Run results
database application technology [02102100]2014-2015-2 on-machine experiment 7-macro deadline July 7, 2015 2014-2015-2 Experiment 6-report deadline June 22, 2015 2014-2015-2 on-machine experiment 5 Deadline June 6, 2015 final exam related Materials upload deadline July 6, 2015 2014-2015-2 on-machine Experiment 4 deadline May 18, 2015 2014-2015-2 Computer Experiment 3 Deadline May 11, 2015 2014-2015-2 Test 2 deadline May 4, 2015 2014-2015-2 computer experiment 1 deadline April 30, 2015 Object-oriented programming (C + +) [02120011] Lab class name 7 Deadline July 3, 2015 experimental report 8 deadline July 10, 2015 experimental report 7 deadline July 5, 2015 experimental report 6 deadline June 23, 2015 experiment class name 6 deadline June 5, 2015 experimental report 5 Deadline June 12, 2015 Lab call 5 deadline May 22, 2015 experimental report 4 deadline May 21, 2015 Lab name 4 deadline May 8, 2015 Lab call 3 deadline April 24, 2015 experimental report 3 Deadline May 5, 2015 Lab Call 2 deadline April 10, 2015 experimental report 2 deadline April 21, 2015 Lab name 1 deadline March 27, 2015 experimental report 1 deadline March 31, 2015 analog electronic Technology [02120021] Experiment 6 Deadline July 12, 2015 experiment 5 deadline July 12, 2015 Project case experiment deadline July 12, 2015 experimental three deadline July 12, 2015 experimental two deadline July 12, 2015 test deadline July 12, 2015
Crawler simulation Landing Campus network and crawl operations