Python crawler simulates login to the academic affairs office and saves data to the local

Source: Internet
Author: User
Tags webp
I was just getting started with Python. I also wanted to watch a lot of people play crawlers. I found that the first thing that many people use web crawlers do is simulate login, Increasing the difficulty is to get data after simulating login, but there are few Python 3.x simulated login demos on the Internet for reference. In addition, I do not know much about Html, therefore, this first Python crawler is very difficult to write, but the final result is still satisfactory. let's sort out the learning process below. Tool system: win764-bit system browser: ChromePython version: P... I was just getting started with # wiki/1514.html "target =" _ blank "> Python. I want to play with crawlers, find and find out the first thing that many people use web crawlers to do is simulate login. The difficulty is to simulate login and obtain data, but there are few Python on the Internet. the simulated login Demo of x can be referenced. In addition, I do not know much about Html, so this first Python crawler is very difficult to write, but the final result is still satisfactory, the following describes the learning process.


Tools

  • System: win7 64-bit system

  • Browser: Chrome

  • Python version: Python 3.5 64-bit

  • IDE: JetBrains PyCharm (this seems to be used by many people)

I targeted our Office of Academic Affairs, the purpose of this crawler is to get results from the Office of Academic Affairs and save the results in an Excel form, the address of our Office of Academic Affairs is: http://jwc.ecjtu.jx.cn /, every time we get the score, we need to first enter the academic affairs office, then click the score query, enter the public account password, and finally enter the relevant information to obtain the score form, it saves me some time to log on here without the verification code. in this way, we first enter the score query system login interface, first look at how to simulate the login process, press F12 in Chrome browser to open the developer panel:


View form data

Here we need to pass three parameters: user, pass, and Submit, which can easily understand the literal meanings of these words, we can write the first step of this code:Simulate login to academic affairs officeDirectly run the code:

#!/usr/bin/env python3# -*- coding: utf-8 -*-import requestsurl = 'http://jwc.ecjtu.jx.cn/mis_o/login.php'datas = {'user': 'jwc',         'pass': 'jwc',         'Submit': '%CC%E1%BD%BB'         }headers = {'Referer': 'http://jwc.ecjtu.jx.cn/mis_o/login.htm',           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 '                         '(KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',           'Accept-Language': 'zh-CN,zh;q=0.8',           }sessions = requests.session()response = sessions.post(url, headers=headers, data=datas)print(response.status_code)

Code output:

200

It indicates that we have successfully simulated the login. here we use the Requests module. if we do not use it, we can view Chinese documents. it defines HTTP for Humans as easy to use and easy to use, we only need to input the Url address, construct the request header, and pass in the data required by the post method to simulate browser login, here, the session is used to maintain the connection because of the operation to further obtain the score. here, we can check the final return code. for details, refer to the next operation. next:


View post data

This is because the student ID is analyzed and entered, so the rest are empty, so that we can write the query result code:

    score_healders = {'Connection': 'keep-alive',                      'User - Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) '                                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',                      'Content - Type': 'application / x - www - form - urlencoded',                      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',                      'Content - Length': '69',                      'Host': 'jwc.ecjtu.jx.cn',                      'Referer': 'http: // jwc.ecjtu.jx.cn / mis_o / main.php',                      'Upgrade - Insecure - Requests': '1',                      'Accept - Language': 'zh - CN, zh;q = 0.8'                      }    score_url = 'http://jwc.ecjtu.jx.cn/mis_o/query.php?start=' + str(        pagenum) + '&job=see&=&Name=&Course=&ClassID=&Term=&StuID=' + num    score_data = {'Name': '',                  'StuID': num,                  'Course': '',                  'Term': '',                  'ClassID': '',                  'Submit': '%B2%E9%D1%AF'                  }    score_response = sessions.post(score_url, data=score_data, headers=score_healders)    content = score_response.content

The above code is explained here.score_urlIt is not the address displayed on the browser. to obtain the actual address, right-click Chrome and choose View web page source code to find the line:

a href=query.php?start=1&job=see&=&Name=&Course=&ClassID=&Term=&StuID=xxxxxxx

This is the real address, and click this address to transfer in is the real interface, because there are a lot of score data here, so the paging display is used here, thisstart=1The description is the first page. this parameter is variable and needs to be passed in.StuIDThe student ID we entered is followed by a Url:

score_url = 'http://jwc.ecjtu.jx.cn/mis_o/query.php?start=' + str(pagenum) + '&job=see&=&Name=&Course=&ClassID=&Term=&StuID=' + num

Similarly, the Post method is used to pass data and obtain the response content:

score_response = sessions.post(score_url, data=score_data,headers=score_healders)content = score_response.content

Here we use Beautiful Soup 4.2.0 to parse the returned response content, because we want to get the results. here we go to the academic performance query page to view the obtained results in the form of a table on the webpage:

Observe the webpage source code of the table:

 ......

 
Term Student ID Name Course Course Requirements Credits Score Retake 1 Retake exam 2

Here is an example of the first line. although I don't know much about Html, it can be seen from here.Represents a row, whileIt should represent each column in this row. This makes it easy to retrieve each row and then break down each column. print the output to get the expected result:

From bs4 import BeautifulSoupsoup = BeautifulSoup (content, 'HTML. parser ') # Find each row target = soup. findAll ('tr ')

Be careful when decomposing each column here, because the table is divided into three pages and each page can display up to 30 data records, this is because we only collect the scores of students who have graduated, so we do not make statistics on the scores of other students with insufficient data. by default, the scores of senior students are collected. Two variables are used here.iAndjRepresent rows and columns respectively:

# Note: The print here is simply printed on the PyCharm console for verification results. I = 0, j = 0for tag in target [1:]: tds = tag. findAll ('TD ') # get j = 0 from the column header every time # semester = str (tds [0]. string) if semester = 'none': break else: print (semester. ljust (6) + '\ t \ t', end = '') # Student ID studentid = tds [1]. string print (studentid. ljust (14) + '\ t \ t', end = '') j + = 1 # name = tds [2]. string print (name. ljust (3) + '\ t \ t', end = '') j + = 1 # course = tds [3]. string print (course. ljust (20, '') + '\ t \ t', end ='') j + = 1 # course requirements requirments = tds [4]. string print (requirments. ljust (10, '') + '\ t \ t', end ='') j + = 1 # credits scredit = tds [5]. string print (scredit. ljust (2, '') + '\ t \ t', end ='') j + = 1 # score achievement = tds [6]. string print (achievement. ljust (2) + '\ t \ t', end = '') j + = 1 # retake reexaminef = tds [7]. string print (reexaminef. ljust (2) + '\ t \ t', end = '') j + = 1 # retake 2 reexamines = tds [8]. string print (reexamines. ljust (2) + '\ t \ t') j + = 1 I + = 1

I checked many other blogs here and used regular expressions to break down data, indicating that my own regular expressions are not easy to write and I tried but failed. so I had no choice but to use this method, if you have a regular expression that has been successfully tested, please let me know.

Save data to Excel

Because we have understood the specific structure of the website to save the score, we can continue to save the data after each loop parsing. here we usexlwtWrite data to Excel, becausexlwtThe width of the style printed by the module to the Excel file is too small, which affects viewing. Therefore, a method is added to control the style printed to the Excel file:

File = xlwt. workbook (encoding = 'utf-8') table = file. add_sheet ('achieve ') # set the Excel style def set_style (name, height, bold = False): style = xlwt. XFStyle () # Initialize the style font = xlwt. font () # Create a font for the style. name = name # 'Times New Roman 'font. bold = bold font. color_index = 4 font. height = height style. font = font return style

Apply to code:

For tag in target [1:]: tds = tag. findAll ('TD ') j = 0 # semester = str (tds [0]. string) if semester = 'none': break else: print (semester. ljust (6) + '\ t \ t', end = '') table. write (I, j, semester, set_style ('arial', 220) # Student ID studentid = tds [1]. string print (studentid. ljust (14) + '\ t \ t', end = '') j + = 1 table. write (I, j, studentid, set_style ('arial', 220) table. col (I ). width = 256*16 # name = tds [2]. string print (name. ljust (3) + '\ t \ t', end = '') j + = 1 table. write (I, j, name, set_style ('arial', 220) # course = tds [3]. string print (course. ljust (20, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, course, set_style ('arial', 220) # course requirements requirments = tds [4]. string print (requirments. ljust (10, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, requirments, set_style ('arial', 220) # credits scredit = tds [5]. string print (scredit. ljust (2, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, scredit, set_style ('arial', 220) # score achievement = tds [6]. string print (achievement. ljust (2) + '\ t \ t', end = '') j + = 1 table. write (I, j, achievement, set_style ('arial', 220) # retake reexaminef = tds [7]. string print (reexaminef. ljust (2) + '\ t \ t', end = '') j + = 1 table. write (I, j, reexaminef, set_style ('arial', 220) # retake reexamines = tds [8]. string print (reexamines. ljust (2) + '\ t \ t') j + = 1 table. write (I, j, reexamines, set_style ('arial', 220) I + = 1file.save('demo.xls ')

Finally, write a method with a slight integration:

# Obtain the score # Here num indicates the student ID entered, pagenum indicates the number of pages, a total of 76 data records, and 30 entries per page. Therefore, there are three pages in total def getScore (num, pagenum, I, j ): score_healders = {'connection': 'Keep-alive', 'user-Agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) ''applewebkit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 ', 'Content-type': 'application/x-www-form-urlencoded', 'access': 'Text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8 ', 'Content-length': '69 ', 'host': 'jwc .ecjtu.jx.cn ', 'referer': 'http: // c.ecjtu.jx.cn/mis_o/main. php ', 'Upgrade-Insecure-requests': '1', 'Accept-color': 'zh-CN, zh; q = 666666'} score_url =' http://jwc.ecjtu.jx.cn/mis_o/query.php?start= '+ Str (pagenum) +' & job = see & = & Name = & Course = & ClassID = & Term = & StuID = '+ num score_data = {'name ': '', 'stuid': num, 'course':'', 'condition': '', 'classid ':'', 'submit ': '% B2 % E9 % D1 % af'} score_response = sessions. post (score_url, data = score_data, headers = score_healders) # output to text with open('text.txt ', 'wb') as f: f. write (score_response.content) content = score_response.content soup = BeautifulSoup (content, 'HTML. parser ') target = soup. findAll ('tr ') try: for tag in target [1:]: tds = tag. findAll ('TD ') j = 0 # semester = str (tds [0]. string) if semester = 'none': break else: print (semester. ljust (6) + '\ t \ t', end = '') table. write (I, j, semester, set_style ('arial', 220) # Student ID studentid = tds [1]. string print (studentid. ljust (14) + '\ t \ t', end = '') j + = 1 table. write (I, j, studentid, set_style ('arial', 220) table. col (I ). width = 256*16 # name = tds [2]. string print (name. ljust (3) + '\ t \ t', end = '') j + = 1 table. write (I, j, name, set_style ('arial', 220) # course = tds [3]. string print (course. ljust (20, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, course, set_style ('arial', 220) # course requirements requirments = tds [4]. string print (requirments. ljust (10, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, requirments, set_style ('arial', 220) # credits scredit = tds [5]. string print (scredit. ljust (2, '') + '\ t \ t', end ='') j + = 1 table. write (I, j, scredit, set_style ('arial', 220) # score achievement = tds [6]. string print (achievement. ljust (2) + '\ t \ t', end = '') j + = 1 table. write (I, j, achievement, set_style ('arial', 220) # retake reexaminef = tds [7]. string print (reexaminef. ljust (2) + '\ t \ t', end = '') j + = 1 table. write (I, j, reexaminef, set_style ('arial', 220) # retake reexamines = tds [8]. string print (reexamines. ljust (2) + '\ t \ t') j + = 1 table. write (I, j, reexamines, set_style ('arial', 220) I + = 1 TB: print ('a little Bug occurred ') file.save('demo.xls ')

After simulating the login operation, add a judgment:

# Determine whether to log on to def isLogin (num): return_code = response. status_code if return_code = 200: if re. match (r "^ \ d {14} $", num): print ('Please wait a moment') else: print ('Enter the correct student ID ') return True else: return False

FinallymainSo called:

If name = 'main': num = input ('Enter your student ID: ') if isLogin (num): getScore (num, pagenum = 0, I = 0, j = 0) getScore (num, pagenum = 1, I = 31, j = 0) getScore (num, pagenum = 2, I = 62, j = 0)

In PyCharm, pressalt+shift+xShortcut key to run the program:


Final result obtained

So far, success

The above is the Python crawler simulating login to the academic affairs office and saving the data to the local details. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.