Sina news content uses the AJAX dynamic display content, through grasping the packet, discovered the following rule:
Each time you request the next page, the JS column will appear with a new URL:
"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
"|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
"format=json&page=1&callback=newsloadercallback"
"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
"|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
"format=json&page=3&callback=newsloadercallback"
Two different are: Page=?
Python implementation code:
1 #-*-coding:utf-8-*-2 __author__='Administrator'3 4 ImportRe5 fromBs4ImportBeautifulSoup6 Importurllib.request7 ImportJieba8 Importstring9 ImportUrllib.parseTen fromUrllib.errorImportHttperror,urlerror One ImportJSON A - defget_page (num): - return("HTTP://API.ROLL.NEWS.SINA.COM.CN/ZT_LIST?CHANNEL=NEWS&CAT_1=GNXW&CAT_2==GDXW1" the "|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1&" - "Format=json&page={}&callback=newsloadercallback"). Format (str (num)) - - defGet_url (page_url): + #Resolve request Path meaning Chinese or special characters -Page_url=urllib.parse.quote (Page_url, safe=string.printable) + #print (Page_url) Aurl_list=[] at Try: -res =Urllib.request.urlopen (Page_url) - exceptHttperror as E: - Print('The server couldn\ ' t fulfill the request.') - Print('Error Code:', E.code) - returnurl_list in exceptUrlerror as E: - Print('We failed to reach a server.') to Print('Reason:', E.reason) + returnurl_list - Else: the ifRes.getcode () ==200: *Jsdata=res.read (). Decode ("Utf-8") $ " "Panax Notoginseng intercept URL method one - " " the #result=re.findall (' "url": "http.*?\.s?html" ', Jsdata) #.* back add? It can become a non-greedy mode . + #For URL in result: A #url=url.split (":", Maxsplit=1) [1] the #url=url.replace (' \ \ ', "") + #url_list.append (URL) - " " $ intercepting URL method two $ " " -Data=jsdata[21:-2] -Data=re.sub ('\ '','\"', data) theData=re.sub (R"\\u","", data) -Jsondata=json.loads (data)Wuyi forDatinchjsondata["result"]["Data"]: theUrl_list.append (dat["URL"]) - returnurl_list Wu - About $ defGet_context (new_url): - #Resolve request Path meaning Chinese or special characters -Httpurl=urllib.parse.quote (New_url, safe=string.printable) - #print (Httpurl) A #print (Type (httpurl)) + Try: theHtml=Urllib.request.urlopen (Httpurl) - exceptHttperror as E: $ Print('The server couldn\ ' t fulfill the request.') the Print('Error Code:', E.code) the exceptUrlerror as E: the Print('We failed to reach a server.') the Print('Reason:', E.reason) - Else: in ifHtml.getcode () ==200: theRes=html.read (). Decode ("Utf-8") the #Print (res) AboutSoup=beautifulsoup (Res,'Html.parser') the #print (soup.prettify) theresult={} theresult["article"]="'. Join ([P.text.strip () forPinchSoup.select ('#artibody P') [:-1]]) +context=result['article'] -pattern=', |. | "|" |?|!|:| |, |;| | --| | ' | ' |,|\?| \.| \!| `|~|\@|\#|\$|%| \^|\&|\*| (|) |\ (|\) |-|\_|\+|=|\[|\]|\{|\}| "| \ ' |\<|\>|\| |' theLi=re.split (Pattern,context)Bayi #print ("Li") theWith open (r". \traindata.txt",'a', encoding='Utf-8') as file: the forLinchLi: - ifl!="": -Sentence =" ". Join (Jieba.cut (l)) theFile.write (sentence +'\ n') the the if __name__=="__main__": the forIinchRange (1,1001): - Print("Page%d"%i) thePage_url=get_page (i) theurl_list=Get_url (Page_url) the #print (url_list) #[' "http://news.sina.com.cn/c/nd/2017-06-11/doc-ifyfzhac1171724.shtml" ', ...], double citation extra layer with single quote 94 ifurl_list: the forUrlinchurl_list: the #print (eval (URL)) the #print (type (URL))98 #get_context (eval (URL)) #针对方法一截取url AboutGet_context (URL)#intercepting URLs for method two
Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model