[Python]南郵OJ代碼備份爬蟲

最後更新：2018-04-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：style 封裝 raw_input word 學習字元 name pytho regular

之前看過Python學習的經驗，說以project為導向學習。

自己分析了一下，一般接觸Python的都有一定的其它語言基礎，對於程式設計的基本邏輯，文法都有一個大概的瞭解。而Python這樣的指令碼語言。沒有過於獨特的文法，在一定的其它語言的基礎上。更是能夠直接上手的。

之前看Python簡明教程，半天沒有進度。正好遇上Python爬蟲項目，直接上手，方便快捷。

網站：http://acm.njupt.edu.cn/welcome.do?

method=index，正值系統更新，於是寫一個備份代碼的爬蟲。

使用的Python庫

urllib庫：

這個庫封裝了通過URL與網路server通訊的方法。包括HTTP的request，response等等。眼下為止基本夠用。

re庫：

即regularexpress，正則表達式庫。用來在HTML文檔中檢索資訊。

基本架構

首先通過HTTP協議，向目標server提交request請求，然後接受response應答。

我們再從應答中，得到我們須要的內容：使用者cookie和代碼。最後建立本地檔案，把他們放進去就可以。

詳細步驟A.HTTP請求

Python果然是短平快的語言，例如以下三行搞定：

myUrl ="http://acm.njupt.edu.cn/acmhome/index"#目標頁面req=urllib2.Request(myUrl)#用URL得到request對象myResponse = urllib2.urlopen(req)#通過urlopen()函數發送request，得到返回的responsemyPage = myResponse.read()#在response中讀取頁面資訊

B.登入許可權擷取

分析頁面我們看到，要備份代碼須要提交username與password。並且提交的頁面不是首頁，而是登陸頁。依據HTTP的知識，須要用POST方法提交包括表單資訊的request。使用chrome開發人員工具，檢測到提價的表達包括例如以下內容。

這在Python中也好實現：

myUrl ="http://acm.njupt.edu.cn/acmhome/login.do"#url地址改為登入頁self.postdata = urllib.urlencode({           ‘userName‘:self.userName,           ‘password‘:self.passWord})#{}中為Python的字典結構資料，傳入username和password#urlencode()函數把字典結構編碼為特定的data類req=urllib2.Request(           url = myUrl,           data = self.postdata           )#給Request傳入URL以及編碼好的data資料myResponse = urllib2.urlopen(req)myPage = myResponse.read()

C.處理cookie

之前的還少考慮了一個東西。就是登入後要訪問網站的其它頁面時。須要用到登入的cookie。

Python中沒有特殊定義的訪問貌似是不保留cookie的。於是我們要從寫一個可保留cookie的HTTP通訊方式。

首先是Python中的幾個概念：

opener：用於通訊的對象，之前代碼中urllib.urlopen()使用系統預設的opener，等價於default_opener.urlopen()

handler：一個opener包括多個hander。用於處理通訊間的各種子問題，包括cookie問題。

於是，上面處理cookie的方法就是。重寫一個opener，給予其可處理cookie的handler。

cookie_support =urllib2.HTTPCookieProcessor(cookielib.CookieJar())#建立cookie_handleropener =urllib2.build_opener(cookie_support,urllib2.HTTPHandler)#用cookie_handler建立cookie_openerurllib2.install_opener(opener)#設定為預設opener

到此，我們能夠實現登入許可權擷取。

D.定位到字碼頁面

我們要從首頁開始，找到字碼頁面。本來直接擷取URL就可以。

只是發現字碼頁面URL例如以下：

這個頁麵包括了時間和登入資訊的未知編碼。應該是通過轉義的。這裡的解決方案是通過已知頁面擷取URL而不是手動輸入。

分析頁面後。能夠例如以下獲得字碼頁面：

首頁-->使用者資訊-->通過代碼-->’G++|GCC|JAVA’欄位超連結，例如以下

於是，解析獲得的HTML，得到超連結：

myItem = re.findall(‘<ahref=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ‘,myPage,re.S)for item in myItem:url=‘http://acm.njupt.edu.cn/acmhome/solutionCode.do?
id=‘+item[37:len(item)-2]

E.扣去文本

如上。能夠看出網站是用XML來儲存，轉義為HTML。於是我們要替換文本中的轉義標籤。得到正常文本：

class Tool:    A= re.compile("?\;")    B= re.compile("\<BR\>")    C= re.compile("<\;")    D= re.compile(">\;")    E= re.compile(""\;")    F= re.compile("&")    G= re.compile("Times\ New\ Roman\"\>")    H= re.compile("\</font\>")    I= re.compile("‘")    J= re.compile("語言:(.*)?face=\"",re.DOTALL)   def replace_char(self,x):       x=self.A.sub(" ",x)       x=self.B.sub("\r",x)       x=self.C.sub("<",x)       x=self.D.sub(">",x)       x=self.E.sub("\"",x)       x=self.F.sub("&",x)       x=self.G.sub("",x)       x=self.H.sub("",x)       x=self.I.sub("\‘",x)       x=self.J.sub("",x)       return x

*注意，Python中字串有替換函數，str.replace(替換字元，原字元)。只是假設要依據正則表達式來替換。須要用re模組的re.sub()函數才行。replace不能使用正則表達式。

*還有。replace返回替換的字串，原始字串沒有不論什麼改變

F.存入檔案

首先要得到代碼的中文題目作為檔案名稱。只是這在自己的字碼頁看不到，僅僅能到代碼首頁去找。找到後抓取<title>中的欄位作為使用者名稱就可以。

tname=re.findall(‘title\>.*?\</title‘,p,re.S)f =open(tname[0][6:len(tname[0])-7]+‘_‘+sname[8:len(sname)-8]+‘.txt‘,‘w+‘)f.write(self.mytool.replace_char(mytem[0]))f.close()

終於程式

# -*- coding: cp936 -*-#copyright by B08020129 import urllib2import urllibimport reimport threadimport timeimport cookielib cookie_support =urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)urllib2.install_opener(opener)  class Tool:    A= re.compile("?\;")    B= re.compile("\<BR\>")    C= re.compile("<\;")    D= re.compile(">\;")    E = re.compile(""\;")    F = re.compile("&")    G =re.compile("Times\ New\ Roman\"\>")    H= re.compile("\</font\>")    I= re.compile("‘")    J= re.compile("語言:(.*)?face=\"",re.DOTALL)   def replace_char(self,x):       x=self.A.sub(" ",x)       x=self.B.sub("\r",x)        x=self.C.sub("<",x)       x=self.D.sub(">",x)       x=self.E.sub("\"",x)       x=self.F.sub("&",x)       x=self.G.sub("",x)       x=self.H.sub("",x)       x=self.I.sub("\‘",x)       x=self.J.sub("",x)       return x class HTML_Model:    def __init__(self,u,p):        self.userName = u        self.passWord =p       self.mytool = Tool()       self.page = 1       self.postdata = urllib.urlencode({           ‘userName‘:self.userName,           ‘password‘:self.passWord})    def GetPage(self):       myUrl = "http://acm.njupt.edu.cn/acmhome/login.do"       req=urllib2.Request(           url = myUrl,           data = self.postdata           )       myResponse = urllib2.urlopen(req)       myPage = myResponse.read()       flag = True       while flag:           myUrl="http://acm.njupt.edu.cn/acmhome/showstatus.do?
problemId=null&contestId=null&userName="+self.userName+"&result=1&language=&page="+str(self.page)           #print(myUrl)           myResponse = urllib2.urlopen(myUrl)           myPage = myResponse.read()           st="\<a\ href\=.*?G\+\+"           next = re.search(st,myPage)           if next:                flag = True           else:                flag = False           myItem = re.findall(‘<ahref=\"/acmhome/solutionCode\.do\?id\=.*?\"\ ‘,myPage,re.S)           for item in myItem:                #print(item)               url=‘http://acm.njupt.edu.cn/acmhome/solutionCode.do?id=‘+item[37:len(item)-2]                #print(url)                myResponse =urllib2.urlopen(url)                myPage = myResponse.read()                mytem = re.findall(‘語言.*?
</font>.*?Times NewRoman\"\>.*?\</font\>‘,myPage,re.S)                #print(mytem)                sName = re.findall(‘源碼--.*?</strong‘,myPage,re.S)                for sname in sName:                   #print(sname[2:len(sname)-8])                   name="http://acm.njupt.edu.cn/acmhome/problemdetail.do?&method=showdetail&id="+sname[8:len(sname)-8];                    #print(name)                   p=urllib2.urlopen(name).read()                    #print(p)                   tname=re.findall(‘title\>.*?\</title‘,p,re.S)                    print(tname[0][6:len(tname[0])-7]+‘_‘+sname[8:len(sname)-8])                    f =open(tname[0][6:len(tname[0])-7]+‘_‘+sname[8:len(sname)-8]+‘.txt‘,‘w+‘)                    f.write(self.mytool.replace_char(mytem[0]))                    f.close()                    print(‘done!‘)           self.page = self.page+1 print u‘plz input the name‘u=raw_input()print u‘plz input password‘p=raw_input()myModel =HTML_Model(u,p)myModel.GetPage()

得到檔案

以及檔案裡的正常代碼：

下一步用更好地方法試著一鍵注冊及提交全部代碼。

[Python]南郵OJ代碼備份爬蟲

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More