Use Webcollector 2.x with another project Weibohelper to crawl data directly from Sina Weibo (no need to manually obtain cookies)
1. Import all jar packages for Webcollector 2.x and Weibohelper
Two items Address: http://git.oschina.net/webcollector/WebCollector
Http://git.oschina.net/webcollector/WeiboHelper
2. Sample code:
Package Cn.edu.hfut.dmic.webcollector.weiboapi;import Cn.edu.hfut.dmic.webcollector.crawler.deepcrawler;import Cn.edu.hfut.dmic.webcollector.model.links;import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.net.httprequesterimpl;import Org.jsoup.nodes.element;import org.jsoup.select.elements;/** * * @author hu */public class Weibocrawler extends deepcrawler{public Weibocrawler (Strin G Crawlpath) throws Exception {super (Crawlpath); /* For Sina Weibo cookies, account passwords are transmitted in clear text, please use the trumpet */String Cookie=weibocn.getsinacookie ("Weibo username", "Weibo password"); Httprequesterimpl myrequester= (Httprequesterimpl) this.gethttprequester (); Myrequester.setcookie (cookie); @Override Public Links visitandgetnextlinks (Page page) {/* extract Weibo */Elements Weibos=page.getdoc (). Sele CT ("div.c"); for (Element Weibo:weibos) {System.out.println (Weibo.text ()); }/* If you want to crawl a comment, you can extract the URL of the comment page and return the */return null; } PublIC static void Main (string[] args) throws exception{Weibocrawler crawler=new weibocrawler ("/home/hu/data/weibo"); Crawler.setthreads (3); /* Crawl the first 5 pages of someone's microblog */for (int i=0;i<5;i++) {crawler.addseed ("http://weibo.cn/zhouhongyi?vt=4&page=" +i) ; } crawler.start (1); } }
Operation Result:
For 2015 years, I wish Weibo friends and 360 users a happy New Year! By the way, I'd like to report to you. 2014 360 progress in science and technology: as of December 31, 2014, 360 of the total number of patent applications in the past year reached 1999, including domestic invention patent applications: 1570; Appearance and utility model patents: 212 pieces, overseas patent applications: 217, The total number of patent applications has been over 4,000 pieces. Likes [1422] forwarding [221] comments [446] collection 01 month 01th 00:09 from a mobile phone will not be forwarded Xu Xian Weibo: " Shandong Province Civil Affairs Department commitment: injured the original Kuomintang anti-Japanese war veterans can enjoy the same treatment with army "clear to live in the rural areas and towns without work units and life of the original Kuomintang anti-Japanese war veterans, to give life hardship relief, rescue standards can refer to the anti-Japanese armed forces in the township demobilized Soldiers implementation, the required funds through the self- Social donations and other channels to solve. HTTP://T.CN/RZYVNR3 Original [373] original [2246] comments [399] Forwarding Reason: Shandong Ministry of Civil Affairs do good, praise! @ Sun Chunlong #寻找你身边的抗战老兵 # [582] forwarding [274] comments [303] collection 2014-12-31 11:13:02 from a mobile phone not forwarding 360 Antivirus Weibo: On the last day of 2014, little fans were less than 100,000. Worry Ah. Do how With the small part just got 5 sets of 360 children's watches, a lottery is the best. Look at this 100,000 fans, the probability of winning is still very big drop. In this way, as long as ① attention to my microblog, ② forward this microblog, ③@ a friend, there will be a chance to win! Turn up, Amitabha, forward the big fortune! Do not turn the hair small wealth! [photos Total 2 photos] original [94] original [2531] comments [1076] Reason: Not only I this low winning rate of the specialist, and wish you a rich New Year @360 security defender Likes [313] forwarding [464] comments [656] Favorites 2014-12-31 14:51:13 from 360 Security browser forwarded Augo Weibo: Baidu anti-Virus deleted. @ Zhou Hongyi @360 Security defender @360 Customer service you three to listen to me, yesterday to find your 167Engineer remote help me to delete Baidu anti-virus software, get a half-day said to fix, today go home boot, the dog thing and death and resurrection. Don't hit the "safest" slogan if you don't fix it for me. Baidu Antivirus has been in the computer to haunt me for a long time, delete n times useless. Another: @ Baidu anti-virus you go to die [photos of 4] original [102] original [542] comments [173] Forwarding reasons: Bo Master this year April 24 deleted Baidu Antivirus, so to 360 seek assistance, @ 360 Security defender finally solve the user problem? Oh, Baidu Wolf sex really strong//@ guancheng small bright: Anyway i uninstall Baidu Antivirus, or use 360 software housekeeper, otherwise go not gener [black line] Praise [530] forwarding [428] comments [966] Collection 2014-12-30 19:28:37 from one plus cell phone
Webcollector Crawler official website: https://github.com/CrawlScript/WebCollector
Technical Discussion group:250108697
Crawl Sina Weibo with Webcollector 2.x (no need to manually obtain cookies)