Goal: to crawl a website race schedule, Dynamic Web page, you need to find the corresponding Ajax request (for specific reference: 53399949)
#-*-Coding:utf-8-*-import sysimport reimport urllib.requestlink = "https://***" r = urllib.request.Request (link) R.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/59.0.3071.115 safari/537.36 ') HTML = Urllib.request.urlopen (R, timeout=500). Read () HTML = Bytes.decode (html,encoding= "GBK")
#返回大量json, need to extract #找出返回json中对应正则匹配的字符串js = Re.findall (' "N": "(. *?)" ', HTML) i=0 #循环打印比赛信息Try:while (1):
#将字符串Unicode转化为中文 and output print (Js[i].encode (' Utf-8 '). Decode (' Unicode_escape '), Js[i+1].encode (' Utf-8 '). Decode ( ' Unicode_escape '), "VS", Js[i+2].encode (' Utf-8 '). Decode (' Unicode_escape ')) i=i+3
#当所有赛程爬取结束时, an error "Indexerror:list index out of range" will be made, so exception handling except Indexerror:print ("finished") /c6>
Summarize the points of attention:
1. Python 3 uses this Import Urllib.request
Because Urllib and urllib2 fit in.
2, the conversion of the string Unicode to Chinese attention python3 and python2 different representations:
python3: print string. Encode (' Utf-8 '). Decode (' Unicode_escape ')
python2: Print string. Decode (' Unicode_escape ')
3, Re.findall ()
As for this function, the rule of his output can be referred to what I wrote earlier: http://www.cnblogs.com/4wheel/p/8497121.html
"n": "(. *?)" This expression only outputs (. *?) This part (why, or refer to the article I wrote earlier), plus the question mark is the non-greedy mode, not the greedy mode
By the way practice explains the greedy pattern
Example
Summary: The non-greedy pattern is to match as little as possible in the case of regular expressions.
Instead, the greedy pattern is to match as many matches as possible in the case of regular expressions.
Python Crawl soccer Match schedule notes