Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

Last Update:2017-06-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sina news content uses the AJAX dynamic display content, through grasping the packet, discovered the following rule:

Each time you request the next page, the JS column will appear with a new URL:

"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
     "|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
     "format=json&page=1&callback=newsloadercallback"

"Http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1"
     "|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1& "
     "format=json&page=3&callback=newsloadercallback"

Two different are: Page=?

Python implementation code:

1 #-*-coding:utf-8-*-2 __author__='Administrator'3 4 ImportRe5  fromBs4ImportBeautifulSoup6 Importurllib.request7 ImportJieba8 Importstring9 ImportUrllib.parseTen  fromUrllib.errorImportHttperror,urlerror One ImportJSON A  - defget_page (num): -    return("HTTP://API.ROLL.NEWS.SINA.COM.CN/ZT_LIST?CHANNEL=NEWS&CAT_1=GNXW&CAT_2==GDXW1" the      "|| =gatxw| | =zs-pl| | =mtjj&level==1| | =2&show_ext=1&show_all=1&show_num=22&tag=1&" -      "Format=json&page={}&callback=newsloadercallback"). Format (str (num)) -  - defGet_url (page_url): +     #Resolve request Path meaning Chinese or special characters -Page_url=urllib.parse.quote (Page_url, safe=string.printable) +     #print (Page_url) Aurl_list=[] at     Try: -res =Urllib.request.urlopen (Page_url) -     exceptHttperror as E: -         Print('The server couldn\ ' t fulfill the request.') -         Print('Error Code:', E.code) -         returnurl_list in     exceptUrlerror as E: -         Print('We failed to reach a server.') to         Print('Reason:', E.reason) +         returnurl_list -     Else: the         ifRes.getcode () ==200: *Jsdata=res.read (). Decode ("Utf-8") $             " "Panax Notoginseng intercept URL method one -             " " the             #result=re.findall (' "url": "http.*?\.s?html" ', Jsdata) #.* back add? It can become a non-greedy mode . +             #For URL in result: A             #url=url.split (":", Maxsplit=1) [1] the             #url=url.replace (' \ \ ', "") +             #url_list.append (URL) -             " " $ intercepting URL method two $             " " -Data=jsdata[21:-2] -Data=re.sub ('\ '','\"', data) theData=re.sub (R"\\u","", data) -Jsondata=json.loads (data)Wuyi              forDatinchjsondata["result"]["Data"]: theUrl_list.append (dat["URL"]) -             returnurl_list Wu  -  About  $ defGet_context (new_url): -     #Resolve request Path meaning Chinese or special characters -Httpurl=urllib.parse.quote (New_url, safe=string.printable) -     #print (Httpurl) A     #print (Type (httpurl)) +     Try: theHtml=Urllib.request.urlopen (Httpurl) -     exceptHttperror as E: $         Print('The server couldn\ ' t fulfill the request.') the         Print('Error Code:', E.code) the     exceptUrlerror as E: the         Print('We failed to reach a server.') the         Print('Reason:', E.reason) -     Else: in         ifHtml.getcode () ==200: theRes=html.read (). Decode ("Utf-8") the             #Print (res) AboutSoup=beautifulsoup (Res,'Html.parser') the             #print (soup.prettify) theresult={} theresult["article"]="'. Join ([P.text.strip () forPinchSoup.select ('#artibody P') [:-1]]) +context=result['article'] -pattern=', |. | "|" |?|!|:| |, |;| | --| | ' | ' |,|\?| \.| \!| `|~|\@|\#|\$|%| \^|\&|\*| (|) |\ (|\) |-|\_|\+|=|\[|\]|\{|\}| "| \ ' |\<|\>|\| |' theLi=re.split (Pattern,context)Bayi             #print ("Li") theWith open (r". \traindata.txt",'a', encoding='Utf-8') as file: the                  forLinchLi: -                     ifl!="": -Sentence =" ". Join (Jieba.cut (l)) theFile.write (sentence +'\ n') the  the if __name__=="__main__": the      forIinchRange (1,1001): -         Print("Page%d"%i) thePage_url=get_page (i) theurl_list=Get_url (Page_url) the         #print (url_list) #[' "http://news.sina.com.cn/c/nd/2017-06-11/doc-ifyfzhac1171724.shtml" ', ...], double citation extra layer with single quote 94         ifurl_list: the              forUrlinchurl_list: the                 #print (eval (URL)) the                 #print (type (URL))98                 #get_context (eval (URL)) #针对方法一截取url AboutGet_context (URL)#intercepting URLs for method two

Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler: Crawl Sina News content (from the current time to a certain time period), and use Jieba participle, used to train their own word segmentation model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support