[Python] South mail OJ code backup crawler

Source: Internet
Author: User
Tags urlencode chrome developer

I've seen Python's learning experience before, and say project-oriented learning.

Analysis of their own, the general contact Python has a certain base of other languages, for the basic logic of programming, grammar has a rough understanding. and a scripting language like Python. Not overly unique in grammar, on the basis of certain other languages. can be directly started.

Before looking at the Python concise tutorial, half a day without progress. Just meet the Python crawler project, get started directly, convenient and quick.

Site: Http://acm.njupt.edu.cn/welcome.do?

Method=index, as the system updates, so write a backup code crawler.

Python Library for use

Urllib Library:

This library encapsulates the method of communicating with the network server through a URL. including HTTP request,response and so on. So far it's basically enough.

Re Library:

That is, regularexpress, which is the form of the library. Used to retrieve information in an HTML document.

Basic framework

First, the request requests are submitted to the target server via the HTTP protocol, and then the response answer is accepted.

We then get what we need from the answer: the user cookie and the code. Finally create a new local file and put them in it.

Detailed step A.HTTPRequest

Python is a fast track approach language, such as the following three lines:

Myurl = "Http://acm.njupt.edu.cn/acmhome/index" #目标页面req =urllib2. Request (Myurl) #用URL得到request对象myResponse = Urllib2.urlopen (req) #通过urlopen () function sends a request to get the returned responsemypage = Myresponse.read () #在response中读取页面信息

B. Access to login permissions

Analysis page We see that to back up the code needs to submit username and password. and the submitted page is not the homepage, but the landing page. Based on HTTP knowledge, the Post method is required to submit the request containing the form information. Using the Chrome Developer tool, the expression of a price increase is included, for example, the following.



This is also well-implemented in Python:

Myurl = "Http://acm.njupt.edu.cn/acmhome/login.do" #url地址改为登入页self. PostData = Urllib.urlencode ({           ' userName ': Self.username,           ' password ': Self.password}) for Python's dictionary structure data in #{}, incoming UserName and Password#urlencode () The function encodes the dictionary structure into a specific data class req=urllib2. Request (           url = myurl,           data = Self.postdata           ) #给Request传入URL以及编码好的data数据myResponse = Urllib2.urlopen (req) MyPage = Myresponse.read ()

C. Processing cookies

There was less to consider a thing before. is to visit other pages of the site when you log in. A login cookie is required.

There are no specially defined visits in Python that seem to be not reserved for cookies. So we're going to write an HTTP communication that preserves cookies.

The first is a few concepts in Python:

Opener: The object used for communication, Urllib.urlopen () in the previous code uses the system default opener, equivalent to Default_opener.urlopen ()

Handler: A opener includes multiple hander. Used to handle various seed issues between communications, including cookie issues.

So, the way the cookie is handled above is. Rewrite a opener to give it the handler to handle cookies.

Cookie_support =urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) #新建cookie_handleropener =urllib2.build_opener (cookie_support,urllib2. HttpHandler) #用cookie_handler建立cookie_openerurllib2. Install_opener (opener) #设定为默认opener

In this, we are able to achieve login access.

D. Navigate to the code page

We'll start with the homepage and find the code page. It would have been possible to get the URL directly.

Just find code page URLs such as the following:

This page includes an unknown encoding of the time and login information. should be escaped through. The workaround here is to get the URL from the known page instead of entering it manually.

After analyzing the page. You can get the code page for example:

Home-User Info-through code--' g++| gcc| JAVA ' field hyperlinks, such as the following

So, parse the resulting HTML and get the hyperlink:

MyItem = Re.findall (' <ahref=\ '/acmhome/solutioncode\.do\?id\=.*?\ ' \ ', Mypage,re. S) for item in Myitem:url= ' Http://acm.njupt.edu.cn/acmhome/solutionCode.do?

Id= ' +item[37:len (item)-2]

E. Deduction of text


Above You can see that the site is stored in XML and escaped as HTML. So we're going to replace the escaped label in the text. Get normal text:

Class Tool:    a= re.compile ("? \;")    b= re.compile ("\<br\>")    c= re.compile ("<\;")    D= re.compile (">\;")    E= re.compile ("" \; ")    F= re.compile ("&")    g= re.compile ("times\ new\ roman\" \> ")    h= re.compile (" \</font\> ")    i= Re.compile ("'")    j= Re.compile ("Language: (. *)? face=\" ", Re. Dotall)   def replace_char (self,x):       x=self. A.sub ("", X)       x=self. B.sub ("\ r", x)       x=self. C.sub ("<", x)       x=self. D.sub (">", x)       x=self. E.sub ("\" ", x)       x=self. F.sub ("&", X)       x=self. G.sub ("", X)       x=self. H.sub ("", X)       x=self. I.sub ("\ '", x)       x=self. J.sub ("", x)       return x

* Note that there are substitution functions for strings in Python, str.replace (substitution characters, original characters). Just assume that you want to replace it with a regular form. You need to use the re.sub () function of the RE module. Replace cannot use the normal form.

* and. Replace returns the replacement string, and the original string has no change whatsoever

F. Depositing files

First, you need to get the Chinese title of the code as the file name. It's just that you can't see it on your code page, just go to the Code home page. After you find the field in the fetch <title>, use it as a username.

Tname=re.findall (' Title\>.*?\</title ', p,re. S) F =open (Tname[0][6:len (tname[0]) -7]+ ' _ ' +sname[8:len (sname) -8]+ '. txt ', ' w+ ') F.write (Self.mytool.replace_char ( Mytem[0]) F.close ()
Finally the program

#-*-coding:cp936-*-#copyright by B08020129 import urllib2import urllibimport reimport threadimport timeimport Cookieli b cookie_support =urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) opener = Urllib2.build_opener (cookie_support,urllib2.    HttpHandler) Urllib2.install_opener (opener) class tool:a= re.compile ("? \;")    b= re.compile ("\<br\>") c= re.compile ("<\;")    D= re.compile (">\;")    E = Re.compile ("" \; ") F = Re.compile ("&") G =re.compile ("times\ new\ roman\" \> ") h= re.compile (" \</font\> ") i= Re.compile ( "'") j= Re.compile ("Language: (. *)? face=\" ", Re. Dotall) def Replace_char (self,x): X=self. A.sub ("", X) x=self. B.sub ("\ r", x) x=self. C.sub ("<", x) x=self. D.sub (">", x) x=self. E.sub ("\" ", x) x=self. F.sub ("&", X) x=self. G.sub ("", X) x=self. H.sub ("", X) x=self. I.sub ("\ '", x) x=self.      J.sub ("", X) return X class Html_model:def __init__ (self,u,p): Self.username = u  Self.password =p Self.mytool = Tool () self.page = 1 Self.postdata = Urllib.urlencode ({' User Name ': self.username, ' Password ': Self.password}) def getpage (self): Myurl = "HTTP://ACM.NJUPT.EDU.CN/ACM Home/login.do "Req=urllib2. Request (url = myurl, data = self.postdata) Myresponse = Urllib2.urlopen (req) m Ypage = Myresponse.read () flag = True while flag:myurl= "Http://acm.njupt.edu.cn/acmhome/showstatus. Do?

Problemid=null&contestid=null&username= "+self.username+" &result=1&language=&page= "+str ( Self.page) #print (myurl) myresponse = Urllib2.urlopen (myurl) mypage = Myresponse.read () St= "\<a\ href\=.*? g\+\+ "next = Re.search (st,mypage) If Next:flag = True else: Flag = False myitem = Re.findall (' <ahref=\ "/acmhome/solutioncode\.do\?id\=.*?\" \ ', Mypage,re. S) for item in myitem: #print (item) url= ' Http://acm.njupt.edu.cn/acmhome/solutionCo De.do?id= ' +item[37:len (item)-2] #print (URL) myresponse =urllib2.urlopen (URL) MyPage = Myresponse.read () Mytem = Re.findall (' Language. *?

</font>.*? Times newroman\ "\>.*?\</font\>", Mypage,re. S) #print (mytem) sName = Re.findall (' Source--. *?</strong ', Mypage,re. S) for sname in sname: #print (Sname[2:len (sname)-8]) name= "HTTP://ACM.N Jupt.edu.cn/acmhome/problemdetail.do?&method=showdetail&id= "+sname[8:len (sname)-8]; #print (name) p=urllib2.urlopen (name). Read () #print (p) tname=re.find All (' Title\>.*?\</title ', p,re. S) Print (Tname[0][6:len (tname[0]) -7]+ ' _ ' +sname[8:len (sname)-8]) F =open (tname[0][6:l En (tname[0]) -7]+ ' _ ' +sname[8:len (sname) -8]+ '. txt ', ' w+ ') f.write (Self.mytool.replace_char (mytem[0))) F.close () print (' done! ') Self.page = self.page+1 print u ' plz input the name ' U=raw_input () print u ' plz input password ' p=raw_input () MyModel =html_mod El (u,p) MymoDel. GetPage ()

Get the file


And the normal code in the file:


The next step is to try a one-click Booklet and submit all the code in a better way.


[Python] South mail OJ code backup crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.