In Leetcode wrote 105 high-profile film branch, consider relocating to their own GitHub, to make a problem-solving question Bank, the interview can also show a
But! But!
Leetcode Online IDE function not too comfortable, I directly on the line a a lot of problems, local no code, unless there is a problem debugging a half-day a does not come, local to save code
So I thought, just use Python to crawl the AC code on the Leetcode, and then throw it into the local GitHub folder, then a synchronous Dafa
About the knowledge involved:
0. Cookies
1, the structure of the website analysis
2. Script Login
3. Script Crawl
--------------------------------------------------------------------------------------------------------------- -------------------------------
First, automatic login
Python's Cookielib + urllib2 + urllib, and then leetcode this site has a Django bird code, when visiting the homepage will be sent as a cookie, and on the login page need to submit this code at the same time, pay attention to first visit the homepage, Extract this code and then visit the login page and submit it together.
Then there is to modify the header, I changed the Referer, has been 403,wtf.
Code:
Import urllib2
Import cookielib
Import urllib
Mydir = R ' C:/users/user/documents/github/leetcode /'
Myhost = R ' https://oj.leetcode.com '
cookie = cookielib. Cookiejar ()
handler = Urllib2. Httpcookieprocessor (cookie)
Urlopener = Urllib2.build_opener (handler)
Urlopener.open (' https:// oj.leetcode.com/')
Csrftoken = ""
for CK in cookie:
Csrftoken = ck.value
Login = "SHADOWMYDX"
MyPwd = "**********" # password
values = {' Csrfmiddlewaretoken ': csrftoken, ' login ': login, ' password ': mypwd, ' Remember ': ' on '}
values = Urllib.urlencode (values)
headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 ', \
' Origin ': ' https://oj.leetcode.com ', ' Referer ': ' https:// Oj.leetcode.com/accounts/login/'}
request = Urllib2. Request ("https://oj.leetcode.com/accounts/login/", values,headers=headers)
URL = urlopener.open (request)
page = Url.read ()
Second, climb the station
Cut into several sub-problems. First, find the title of the AC address, second, find the code address of the AC, and finally, crawl the AC code into the local GitHub project folder.
Since Leetcode IDE is the dynamic page of JS implementation, it is not possible to use Firebug to examine elements directly, but to capture the AC code from the JS code sent over. This means that a dictionary is required to convert special characters
def savecode (Code,title):
Global Mydir
f = Open (Mydir + title + '. cpp ', ' W ')
F.write (Code)
def downloadcode (Refer,codeadd,title):
Global headers
Global Urlopener
Global Myhost
headers[' Referer '] = refer
Request = Urllib2. Request (Codeadd,headers=headers)
url = urlopener.open (request)
all = Url.read ()
tar = "storage.put (' cpp ',"
index = All.find (tar,0)
Start = All.find (' class solution ', index)
Finis = All.find ("');", start)
Code = All[start:finis]
Tocpp = {' \u000d ': ' \ n ', ' \u000a ': ', ' \u003b ': '; ', ' \u003c ': ' < ', ' \u003e ': ' > ', ' \u003d ': ' = ', \
' \u0026 ': ' & ', ' \u002d ': '-', ' \u0022 ': ' ' ', ' \u0009 ': ' \ t ', ' \u0027 ': ' ', ' \u005c ': ' \ \ '}
For key in Tocpp.keys ():
Code = Code.replace (Key,tocpp[key])
Savecode (Code,title)
def findcode (Address,title):
Global headers
Global Urlopener
Global Myhost
headers[' Referer ' = Address
Address + = ' submissions/'
print ' now is dealing ' + address + ': ' + title
Request = Urllib2. Request (Address,headers=headers)
url = urlopener.open (request)
all = Url.read ()
tar = ' class= ' text-danger status-accepted "'
index = All.find (tar,0)
Start = All.find (' href= "', index)
Finis = All.find (' > ', start)
Downloadcode (address,myhost + all[start + 6:finis],title)
def findadd (page):
index = 0
While 1:
index = page.find (' class= ' AC ', index)
If Index! =-1:
Index + = 1
Start = Page.find (' <td><a href= "', index)
Finis = Page.find (' > ', start)
Tmpfin = Page.find (' < ', finis)
title = Page[finis + 2:tmpfin]
Findcode (myhost + page[start + 13:finis],title)
Else
Break
Finally, call Findadd (page), done
PostScript: The first idea is to do a multi-threaded version, and then think about the first to realize the function, or else add a rotten tail toys.
"Original" uses Python to crawl Leetcode's AC code to GitHub