Crawler simulation Landing Campus network and crawl operations

Source: Internet
Author: User
Tags eol

First open the network teaching platform of campus network http://eol.zhbit.com/homepage/common/


Find the appropriate form code

The name of the user name is Ipt_loginusername

The name of the password is Ipt_loginpassword

The address submitted is http://www.zhbit.com/homepage/common/login.jsp

Through the browser's capture, it is true that only these two data are submitted


The page becomes this when the submission succeeds

Click to enter

The discovery address has become http://eol.zhbit.com/main.jsp.

That's why we need to get to this address after our bot simulations are successful.


After entering this interface, we see that there are many disciplines (database, object-oriented, analog electronic technology)

There are links to every subject

We also find the corresponding code


According to the above analysis, the following program can be written

=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding  cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({     ' ipt_loginusername ': ' Your study number ' , #学号      ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址req  = urllib2. Request (    url =  ' http://eol.zhbit.com/homepage/common/login.jsp ',     data = postdata) Result = opener.open (req) htmlflag = result.read () #匹配 "Welcome to Login" Character Verify if login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.search (htmlflag) if search:     #登陆成功后跳转到 '/http Eol.zhbit.com/main.jsp '     result = opener.open (' http://eol.zhbit.com/main.jsp ')      #读取该页面的html并存储到homeHtml     homehtml = result.read ()      #编写正则表达式匹配需要获得的内容     pattern = re.compile (R ' <a href= "javascript:void\ (0 \)".*?‘ +                           ' onclick= ' changecourse\ (. *?) \ ' \) "' +                           '. *?\n.*? ' + ' title= ' (. *?) " > ')     mainlist = pattern.findall (homehtml)     for m  in mainlist:        print m[0].decode (' GBK ') + " " + M[1].decode (' GBK ') else:    print  "Login Error"

Results after successful operation (this is my course)

/stu_left_course_menu.jsp?lid=24063 database application Technology [02102100]/STU_LEFT_COURSE_MENU.JSP?LID=26510 Object-oriented programming (C + +) [02120011] /STU_LEFT_COURSE_MENU.JSP?LID=26512 analog Electronic Technology [02120021]/stu_left_course_menu.jsp?lid=26902 University English (B) 2 [10120111]/stu_ LEFT_COURSE_MENU.JSP?LID=22771 University Physics Experiment 1 [12110180]/stu_left_course_menu.jsp?lid=27185 University Physics (D) 1 [12120041]/stu_left_ course_menu.jsp?lid=27195 Advanced Mathematics (B) 2 [12120290]/stu_left_course_menu.jsp?lid=23231 Wilderness survival Skills [13120300]/stu_left_course _menu.jsp?lid=27275 Military Training and theoretical education of the Army [21120001]/stu_left_course_menu.jsp?lid=27350 of Western Civilization [39100300]

I output M[0] and m[1 in the program) respectively is the address of the course and the name of the course, the corresponding address of the course is the relative path

Now, we open the database, the course.

#打开数据库页面的地址dataBaseurl = "http://eol.zhbit.com/stu_left_course_menu.jsp?lid=24063" result = Opener.open (databaseurl )


To open a database page using a browser


Find links to course assignments

#打开课程作业页面的地址result  = opener.open (' http://eol.zhbit.com/common/hw/student/hwtask.jsp ') # Reads the HTML of the database page and stores it in Databasehtmldatabasehtml = result.read () pattern = re.compile (R ' <td  align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ') Hw = pattern.findall (databasehtml) pattern2 = re.compile (R ' <td > (. *?) Years (. *?) Month (. *?) Day </td> ') Ti = pattern2.findall (databasehtml) t = 0for h in hw:     print (H.decode (' GBK ')  +              '   Due date '. Decode (' GBK ')  + ti[t][0].decode (' GBK ')  +              ' year '. Decode (' GBK ')  + ti[t][1].decode (' GBK ')  +              ' month '. Decode (' GBK ')  +ti[t][2].decode (' GBK ')  +             ' Day '. Decode (' GBK '))     t += 1 

Run results

/stu_left_course_menu.jsp?lid=24063  database application Technology  [02102100]/stu_left_course_menu.jsp?lid=26510  Object-Oriented programming (C + +)  [02120011]/stu_left_course_ menu.jsp?lid=26512  Analog Electronic Technology  [02120021]/stu_left_course_menu.jsp?lid=26902  College English (B) 2 [10120111 Physics experiment 1 [12110180]/stu_left_course_menu.jsp?lid=27185  of]/stu_left_course_menu.jsp?lid=22771  University University Physics (D) 1 [12120041]/stu_left_course_menu.jsp?lid=27195  Advanced Mathematics (B) 2 [12120290]/stu_left_course_ menu.jsp?lid=23231  Field Survival Skills  [13120300]/stu_left_course_menu.jsp?lid=27275  military training and  [of Army theory education 21120001]/stu_left_course_menu.jsp?lid=27350  The general Theory of Western Civilization  [39100300]2014-2015-2 experiment 7-macro   Deadline July 7, 2015 2014-2015-2 Test 6-report   deadline June 22, 2015 2014-2015-2 test 5  deadline June 6, 2015 final exam related information upload   Deadline July 6, 2015 2014-2015-2 Test 4  deadline May 18, 2015 2014-2015-2 on-machine experiment 3  deadline May 11, 2015 2014-2015-2 on-machine experiment 2  Deadline May 4, 2015 2014-2015-2 on-machine experiment 1  deadline April 30, 2015 


Get Total Database Job code

=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding  cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({     ' ipt_loginusername ': ' Your study number ' , #学号      ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址user_agent  =  ' mozilla/4.0  (compatible; msie 5.5; windows nt) '    headers = {  ' User-agent '  : user_agent }   req = urllib2. Request (url =  ' http://eol.zhbit.com/homepage/common/login.jsp ',                     data = postdata,  headers = headers) Result = opener.open (req) htmlflag = result.read () #匹配 " Welcome to Login "Character verification whether login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.seaRCH (htmlflag) if search:     #登陆成功后跳转到 ' http://eol.zhbit.com/main.jsp '      result = opener.open (' http://eol.zhbit.com/main.jsp ')     # Read the HTML of the page and store it in Homehtml    homehtml = result.read ()     # Write a regular expression match what you need to get     pattern = re.compile (R ' <a href= "javascript:void\ (0\) ".*?‘ +                           ' onclick= ' changecourse\ (. *?) \ ' \) "' +                           '. *?\n.*? ' + ' title= ' (. *?) " > ')     mainlist = pattern.findall (homehtml)     for m  in mainlist:        print m[0].decode (' GBK ') +" " +m[1].decode (' GBK ')      #打开数据库页面的地址     dataBaseurl =  "http://eol.zhbit.com/stu_left_course_menu.jsp?lid=24063"     result =  Opener.open (Databaseurl)     result = opener.open (' HTTP://EOL.ZHBIT.COM/COMMON/HW /student/hwtask.jsp ')      #读取课程作业页面的html并存储到dataBaseHtml      Databasehtml = result.read ()     pattern = re.compile (R ' <td  Align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ')     hw = pattern.findall (databasehtml)      pattern2 = re.compile (R ' <td> (. *?) Years (. *?) Month (. *?) Day </td> ')     ti = pattern2.findall (databasehtml)     t  = 0    for h in hw:         pRint (H.decode (' GBK ')  +                '   Due date '. Decode (' GBK ')  + ti[t][0].decode (' GBK ')  +                ' year '. Decode (' GBK ')  + ti[t][1].decode (' GBK ')  +                ' month '. Decode (' GBK ')  +ti [T] [2].decode (' GBK ')  +               ' Day '. Decode (' GBK '))         t += 1         else:    print  "Login Error"

Note: Chinese after Add. Decode (' GBK '), because the Chinese in Linux is output with Utf-8, and the campus network is GBK encoded so it needs to be decoded. No need to add under windows!!


Just modify the program to get all the homework.

=gbkimport reimport urllibimport urllib2import cookielib# setting the browser's cookiecookie = #coding  cookielib. Cookiejar () Opener = urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) #表单数据postdata =urllib.urlencode ({     ' ipt_loginusername ': ' Your study number ' , #学号      ' ipt_loginpassword ': ' Your password ' #密码}) #设置表单的目标地址user_agent  =  ' mozilla/4.0  (compatible; msie 5.5; windows nt) '    headers = {  ' User-agent '  : user_agent }   req = urllib2. Request (url =  ' http://eol.zhbit.com/homepage/common/login.jsp ',                     data = postdata,  headers = headers) Result = opener.open (req) htmlflag = result.read () #匹配 " Welcome to Login "Character verification whether login successful Pattern = re.compile (r ' Welcome login! ') Search = pattern.seaRCH (htmlflag) if search:     #登陆成功后跳转到 ' http://eol.zhbit.com/main.jsp '      result = opener.open (' http://eol.zhbit.com/main.jsp ')     # Read the HTML of the page and store it in Homehtml    homehtml = result.read ()     # Write a regular expression match what you need to get     pattern = re.compile (R ' <a href= "javascript:void\ (0\) ".*?‘ +                           ' onclick= ' changecourse\ (. *?) \ ' \) "' +                           '. *?\n.*? ' + ' title= ' (. *?) " > ')     mainlist = pattern.findall (homehtml)     for m  in mainlist:        print m[1].decode (' GBK ')          #打开数据库页面的地址          databaseurl =  "Http://eol.zhbit.com" +m[0]        result  = opener.open (Databaseurl)         result =  Opener.open (' http://eol.zhbit.com/common/hw/student/hwtask.jsp ')           #读取课程作业页面的html并存储到dataBaseHtml         dataBaseHtml =  Result.read ()         pattern = re.compile (R ' <td  Align= "left" ><a href= "hwtask.view.jsp.*?class=" Infolist "> (. *?) </a></td> ')         hw = pattern.findall ( databasehtml)         pattern2 = re.compile (R ' <td> (. *?) Years (. *?) Month (. *?) Day </td> ')         ti = pattern2.findall (databasehtml)         t = 0         for h in hw:             print (H.decode (' GBK ')  +                    '   Due date '. Decode (' GBK ')  +  ti[t][0].decode (' GBK ')  +                    ' year '. Decode (' GBK ')  + ti[t][1].decode (' GBK ')  +                    ' month '. Decode (' GBK ')  +ti[t][2].decode (' GBK ')  +                    ' Day '. Decode (' GBK ')              t += 1        print  "\ n"          else:    print  "Login Error"


?

Run results

database application technology [02102100]2014-2015-2 on-machine experiment 7-macro deadline July 7, 2015 2014-2015-2 Experiment 6-report deadline June 22, 2015 2014-2015-2 on-machine experiment 5 Deadline June 6, 2015 final exam related Materials upload deadline July 6, 2015 2014-2015-2 on-machine Experiment 4 deadline May 18, 2015 2014-2015-2 Computer Experiment 3 Deadline May 11, 2015 2014-2015-2 Test 2 deadline May 4, 2015 2014-2015-2 computer experiment 1 deadline April 30, 2015 Object-oriented programming (C + +) [02120011] Lab class name 7 Deadline July 3, 2015 experimental report 8 deadline July 10, 2015 experimental report 7 deadline July 5, 2015 experimental report 6 deadline June 23, 2015 experiment class name 6 deadline June 5, 2015 experimental report 5 Deadline June 12, 2015 Lab call 5 deadline May 22, 2015 experimental report 4 deadline May 21, 2015 Lab name 4 deadline May 8, 2015 Lab call 3 deadline April 24, 2015 experimental report 3 Deadline May 5, 2015 Lab Call 2 deadline April 10, 2015 experimental report 2 deadline April 21, 2015 Lab name 1 deadline March 27, 2015 experimental report 1 deadline March 31, 2015 analog electronic Technology [02120021] Experiment 6 Deadline July 12, 2015 experiment 5 deadline July 12, 2015 Project case experiment deadline July 12, 2015 experimental three deadline July 12, 2015 experimental two deadline July 12, 2015 test deadline July 12, 2015


Crawler simulation Landing Campus network and crawl operations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.