The summer vacation is free. The first bullet is the Django-Based Query System for the academic performance of the Yangtze River University.
Knowledge points involved in this article include: Python crawler, MySQL database, html/css/js basics, selenium and phantomjs basics, MVC design patterns, django framework (Python web development framework ), basic operations on apache server and linux (centos 7 as an example. Therefore, it is suitable for students with the above foundations to learn.
Statement: This blog post is only intended for purely technical exchanges,This article will filter sensitive information. Sorry (I have nothing to do with the problems on the website of the Academic Affairs Office of Changjiang University for any reason ).
Implementation: without the data interface of the Academic Affairs Office (Student Information Security), you can only write crawlers to simulate login to the academic affairs office and then crawl data to prevent the Academic Affairs Office website from crashing, as a result, crawlers fail to cache data. You can retrieve data directly from your database next time. What we need to do is to regularly update the data and synchronize it with the Academic Affairs Office.
Technical Architecture: centos 7 + apache2.4 + mariadb5.5 + Python2.7.5 + mod_wsgi 3.4 + django1.11
------------------------------------------------------------------------
I. Python crawler:
1. Check the logon portal first.
Here, we use FireFox for packet capture analysis. We found that the login was post, with 7 parameters and Verification Code. There are two solutions at this time, one is to use the current very popular technology to use DL for image recognition, and the other is to let users lose their own. First, the cost is relatively high .. If you are not busy, try it. Remember that Python has a library named Pillow or PIL that can be used for image recognition. Try TF during the summer vacation. The second one is low.
2. There is also a way to go up tall. You don't need to worry about the verification code. We will not elaborate on it here. We will simulate the login:
# Coding: utf8from bs4 import BeautifulSoupimport urllibimport urllib2import requestsimport sysreload (sys) sys. setdefaultencoding ('gbk') loginURL = "" cjcxURL = "http://jwc2.yangtzeu.edu.cn: 8080/cjcx. aspx "html = urllib2.urlopen (loginURL) soup = BeautifulSoup (html," lxml ") _ VIEWSTATE = soup. find (id = "_ VIEWSTATE") ["value"] _ EVENTVALIDATION = soup. find (id = "_ EVENTVALIDATION") ["value"] data = {"_ VIEWS TATE ":__ VIEWSTATE," _ EVENTVALIDATION ":__ EVENTVALIDATION," txtUid ":" Account "," btLogin ":" % B5 % C7 % C2 % BC ", "txtPwd": "password", "selKind": "1"} header = {# "Host": "rjc2.yangtzeu.edu.cn: 8080", "User-Agent ": "Mozilla/5.0 (Windows NT 10.0 ;... Gecko/20100101 Firefox/54.0 "," Accept ":" text/html, application/xhtml + x... Lication/xml; q = 0.9, */*; q = 0.8 "," Accept-Language ":" zh-CN, zh; q = 0.8, en-US; q = 0.5, en; q = 0.3 "," Accept-Encoding ":" gzip, deflate "," Content-Type ": "application/x-www-form-urlencoded", # "Content-Length": "644", "Referer": "http://jwc2.yangtzeu.edu.cn: 8080/login. aspx ", #" Cookie ":" ASP. NET_SessionId = 3zjuqi0cnk5514l241csejgx ", #" Connection ":" keep-alive ", #" Upgrade-Insecure-Requests ":" 1 ",} UserSession = requests. session () Request = UserSession. post (loginURL, data, header) Response = UserSession. get (cjcxURL, cookies = Request. cookies, headers = header) soup = BeautifulSoup (Response. content, "lxml") print soup
Next we can see:
Post again (this code is connected ):
__VIEWSTATE2 = soup.find(id="__VIEWSTATE")["value"]__EVENTVALIDATION2 = soup.find(id="__EVENTVALIDATION")["value"]AllcjData = { "__EVENTTARGET":"btAllcj", "__EVENTARGUMENT":"", "__VIEWSTATE":__VIEWSTATE2, "__EVENTVALIDATION":__EVENTVALIDATION2, "selYear":"2017", "selTerm":"1",# "Button2":"%B1%D8%D0%DE%BF%CE%B3%C9%BC%A8" }AllcjHeader = {# "Host":"jwc2.yangtzeu.edu.cn:8080", "User-Agent":"Mozilla/5.0 (Windows NT 10.0;… Gecko/20100101 Firefox/54.0", "Accept":"text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Content-Type":"application/x-www-form-urlencoded",# "Content-Length":"644", "Referer":"http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx",# "Cookie":, "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", }Request1 = UserSession.post(cjcxURL,AllcjData,AllcjHeader)Response1 = UserSession.get(cjcxURL,cookies = Request.cookies,headers=AllcjHeader)soup = BeautifulSoup(Response1.content,"lxml")print soup
No... This get page is still the original page... I think there are two reasons for this post failure: first, the _ VIEWSTATE and _ EVENTVALIDATION variables of asp.net cause the post failure, and second, multiple buttons of a form use js for judgment, as a result, crawlers fail. For Dynamically Loaded pages, normal crawlers still do not work ....
3. Use selenium (web automated testing tool that can simulate mouse clicks) + phantomjs (Browsers without interfaces are faster than chrome and Firefox)
Selenium installation: pip install selenium
Install phantomjs:
(1) Address: http://phantomjs.org/download.html (I downloaded Linux 64-bit)
(2) Decompression: tar-jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2/usr/share/
(3) Installation dependency: yum install fontconfig freetype libfreetype. so.6 libfontconfig. so.1
(4) configure the environment variable: export PATH = $ PATH:/usr/share/phantomjs-2.1.1-linux-x86_64/bin
(5) Input phantomjs in shell. If you can enter the command line, the installation is successful.
Ignore my comments:
# Coding: utf8from bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium. webdriver. common. keys import Keysimport timeimport urllibimport urllib2import sys reload (sys) sys. setdefaultencoding ('utf8') driver = webdriver. phantomJS (); driver. get ("") driver. find_element_by_name ('txtuid '). send_keys ('account') driver. find_element_by_name ('txtpwd '). send_keys ('Password') driver. find_element_by_id ('btlogin' ). Click () cookie = driver. get_cookies () driver. get ("http://jwc2.yangtzeu.edu.cn: 8080/cjcx. aspx ") # print driver. page_source # driver. find_element_by_xpath ("// input [@ name = 'btallcj '] [@ type = 'button']") # js = "document. getElementById ('btallcj '). onclick = function () {__ doPostBack ('btallcj ', '')}" # js = "var ob; ob = document. getElementById ('btallcj '); ob. focus (); ob. click ();) "includriver.exe cute_script (" document. getElementB YId ('btallcj '). click (); ") # time. sleep (2) # Let the operation stop a bit # driver. find_element_by_link_text ("all scores "). click () # Find the 'login' button and click # time. sleep (2) # js1 = "document. form1. _ EVENTTARGET. value = 'btallcj '; "# js2 =" document. form1. _ EVENTARGUMENT. value = '';" includriver.execute_script(js100000000driver.exe cute_script (js2) # driver. find_element_by_name ('_ EVENTTARGET '). send_keys ('btallcj ') # driver. find_element_by_name ('_ EVENTARGUMENT '). send _ Keys ('') # js =" var input = document. createElement ('input'); input. setAttribute ('type', 'den den '); input. setAttribute ('name', '_ EVENTTARGET'); input. setAttribute ('value', ''); document. getElementById ('form1 '). appendChild (input); var input = document. createElement ('input'); input. setAttribute ('type', 'den den '); input. setAttribute ('name', '_ EVENTARGUMENT'); input. setAttribute ('value', ''); document. getElem EntById ('form1'). appendChild (input); var theForm = document. forms ['form1']; if (! TheForm) {theForm = document. Form1;} function _ doPostBack (eventTarget, eventArgument) {if (! TheForm. onsubmit | (theForm. onsubmit ()! = False) {theForm. _ EVENTTARGET. value = eventTarget; theForm. _ EVENTARGUMENT. value = eventArgument; theForm. submit () ;}__ doPostBack ('btallcj ', '')" # js = "var script = document. createElement ('script'); script. type = 'text/javascript '; script. text = 'if (! TheForm) {theForm = document. Form1;} function _ doPostBack (eventTarget, eventArgument) {if (! TheForm. onsubmit | (theForm. onsubmit ()! = False) {theForm. _ EVENTTARGET. value = eventTarget; theForm. _ EVENTARGUMENT. value = eventArgument; theForm. submit () ;}} '; document. body. appendChild (script); "#driver.exe cute_script (js) driver. find_element_by_name ("Button2 "). click () html = driver. page_sourcesoup = BeautifulSoup (html, "lxml") print souptables = soup. findAll ("table") for tab in tables:
For tr in tab. findAll ("tr "):
Print "--------------------"
For td in tr. findAll ("td") [0: 3]:
Print td. getText ()
Now you can only get the required course scores ..... Because all the scores are triggered by js generated by ASP... Instead of directly submit... Looking for a solution. Let's start designing our database...
Ii. Mariadb student database design. Here, we reference the content of our SQL server database on the machine...
My database creation statement:
create database jwc character set utf8;use jwc;create table Student( Sno char(9) primary key, Sname varchar(20) unique, Sdept char(20), Spwd char(20));create table Course( Cno char(2) primary key, Cname varchar(30) unique, Credit numeric(2,1));create table SC( Sno char(9) not null, Cno char(2) not null, Grade int check(Grade>=0 and Grade<=100), primary key(Sno,Cno), foreign key(Sno) references Student(Sno), foreign key(Cno) references Course(Cno));
Iii. Python web environment setup (LAMP ):
1. Because the selected http server is apache, you need to install mod_wsgi (python universal Gateway Interface) to implement interaction between apache and Python programs... If nginx is used, install and configure uwsgi... Similar to java servlet and PHP php-fpm.
Install: yum install mod_wsgi
Configuration: vim/etc/httpd/conf/httpd. conf
This configuration took me a lot of time and thought about it... There are many errors on the internet... The most standard Python web django Development Configuration... Thank you for taking it away.
#config python webLoadModule wsgi_module modules/mod_wsgi.so <VirtualHost *:8080> ServerAdmin root@Vito-Yan ServerName www.yuol.onlne ServerAlias yuol.online Alias /media/ /var/www/html/jwc/media/ Alias /static/ /var/www/html/jwc/static/ <Directory /var/www/html/jwc/static/> Require all granted </Directory> WSGIScriptAlias / /var/www/html/jwc/jwc/wsgi.py # DocumentRoot "/var/www/html/jwc/jwc" ErrorLog "logs/www.yuol.online-error_log" CustomLog "logs/www.yuol.online -access_log" common <Directory "/var/www/html/jwc/jwc"> <Files wsgi.py> AllowOverride All Options Indexes FollowSymLinks Includes ExecCGI Require all granted </Files> </Directory></VirtualHost>
2. Install django below... Pip install django .... Done.
View django version: python-m django -- version
Address: https://www.djangoproject.com
Create a project: python-admin startproject jwc (my website root directory of apache is created under/var/www/html)
3. apcehe configuration: Leave it unpasted. Change the jwc above to ipvc2, change the port to 9000, and then Listen 9000 (Why is 9000 used? the jwc of the first project is 8080, the built-in django server uses python manage. py runserver can be enabled. Its default port is 8000, so no 8000 is needed to avoid conflict. The tomcat server of my jsp project uses port 9090 to avoid conflict. It is best not to use it, generally, port 9000 is used, and others are not recommended for use ).
4. settings. py Configuration:
DEBUG = True DEBUG Enabled
ALLOWED_HOSTS = ['192. 168.47.128 '] Add a host
5. Configure wsgi. py. Don't ask me why... I don't know either .. Use the apache server to start the django project... If you use the server that comes with django, you don't need to change it...
"""WSGI config for jwc2 project.It exposes the WSGI callable as a module-level variable named ``application``.For more information on this file, seehttps://docs.djangoproject.com/en/1.11/howto/deployment/wsgi/"""#import os#from django.core.wsgi import get_wsgi_application#os.environ.setdefault("DJANGO_SETTINGS_MODULE", "jwc2.settings")#application = get_wsgi_application()import os from os.path import join,dirname,abspath PROJECT_DIR = dirname(dirname(abspath(__file__))) import sys sys.path.insert(0,PROJECT_DIR)os.environ.setdefault("DJANGO_SETTINGS_MODULE", "jwc2.settings") from django.core.wsgi import get_wsgi_applicationapplication = get_wsgi_application()
Then the success will be achieved .... The Python web environment is complete...
4. Start our first django project application...