Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

Source: Internet
Author: User

Sina news content uses the AJAX dynamic display content, through grasping the packet, discovered the following rule:

Each time you request the next page, the JS column will appear with a new URL:

"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
"|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
"format=json&page=1&callback=newsloadercallback"
"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
"|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
"format=json&page=3&callback=newsloadercallback"

Two different are: Page=?

Python implementation code:
1 #-*-coding:utf-8-*-2 __author__='Administrator'3 4 ImportRe5  fromBs4ImportBeautifulSoup6 Importurllib.request7 ImportJieba8 Importstring9 ImportUrllib.parseTen  fromUrllib.errorImportHttperror,urlerror One ImportJSON A  - defget_page (num): -    return("HTTP://API.ROLL.NEWS.SINA.COM.CN/ZT_LIST?CHANNEL=NEWS&CAT_1=GNXW&CAT_2==GDXW1" the      "|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1&" -      "Format=json&page={}&callback=newsloadercallback"). Format (str (num)) -  - defGet_url (page_url): +     #Resolve request Path meaning Chinese or special characters -Page_url=urllib.parse.quote (Page_url, safe=string.printable) +     #print (Page_url) Aurl_list=[] at     Try: -res =Urllib.request.urlopen (Page_url) -     exceptHttperror as E: -         Print('The server couldn\ ' t fulfill the request.') -         Print('Error Code:', E.code) -         returnurl_list in     exceptUrlerror as E: -         Print('We failed to reach a server.') to         Print('Reason:', E.reason) +         returnurl_list -     Else: the         ifRes.getcode () ==200: *Jsdata=res.read (). Decode ("Utf-8") $             " "Panax Notoginseng intercept URL method one -             " " the             #result=re.findall (' "url": "http.*?\.s?html" ', Jsdata) #.* back add? It can become a non-greedy mode . +             #For URL in result: A             #url=url.split (":", Maxsplit=1) [1] the             #url=url.replace (' \ \ ', "") +             #url_list.append (URL) -             " " $ intercepting URL method two $             " " -Data=jsdata[21:-2] -Data=re.sub ('\ '','\"', data) theData=re.sub (R"\\u","", data) -Jsondata=json.loads (data)Wuyi              forDatinchjsondata["result"]["Data"]: theUrl_list.append (dat["URL"]) -             returnurl_list Wu  -  About  $ defGet_context (new_url): -     #Resolve request Path meaning Chinese or special characters -Httpurl=urllib.parse.quote (New_url, safe=string.printable) -     #print (Httpurl) A     #print (Type (httpurl)) +     Try: theHtml=Urllib.request.urlopen (Httpurl) -     exceptHttperror as E: $         Print('The server couldn\ ' t fulfill the request.') the         Print('Error Code:', E.code) the     exceptUrlerror as E: the         Print('We failed to reach a server.') the         Print('Reason:', E.reason) -     Else: in         ifHtml.getcode () ==200: theRes=html.read (). Decode ("Utf-8") the             #Print (res) AboutSoup=beautifulsoup (Res,'Html.parser') the             #print (soup.prettify) theresult={} theresult["article"]="'. Join ([P.text.strip () forPinchSoup.select ('#artibody P') [:-1]]) +context=result['article'] -pattern=', |. | "|" |?|!|:| |, |;| | --| | ' | ' |,|\?| \.| \!| `|~|\@|\#|\$|%| \^|\&|\*| (|) |\ (|\) |-|\_|\+|=|\[|\]|\{|\}| "| \ ' |\<|\>|\| |' theLi=re.split (Pattern,context)Bayi             #print ("Li") theWith open (r". \traindata.txt",'a', encoding='Utf-8') as file: the                  forLinchLi: -                     ifl!="": -Sentence =" ". Join (Jieba.cut (l)) theFile.write (sentence +'\ n') the  the if __name__=="__main__": the      forIinchRange (1,1001): -         Print("Page%d"%i) thePage_url=get_page (i) theurl_list=Get_url (Page_url) the         #print (url_list) #[' "http://news.sina.com.cn/c/nd/2017-06-11/doc-ifyfzhac1171724.shtml" ', ...], double citation extra layer with single quote 94         ifurl_list: the              forUrlinchurl_list: the                 #print (eval (URL)) the                 #print (type (URL))98                 #get_context (eval (URL)) #针对方法一截取url AboutGet_context (URL)#intercepting URLs for method two

Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.