Java web crawler Crawl Sina Weibo personal microblog record _

Java web crawler Crawl Sina Weibo personal microblog record __java

Last Update:2018-07-27 Source: Internet

Author: User

Tags java web

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before the topic, first understand the Java Crawl Web page on the specific content of the method, which is called the web Crawler, in this article will only involve simple text information and link crawling. There are only two ways to access HTTP in Java, one is to use the httpconnection of the original ecology, and one is to use encapsulated plug-ins or frames, such as httpclient,okhttp. In the process of testing crawled Web page information, I was using the Jsoup tool, because the tool not only encapsulates HTTP access, there are powerful HTML parsing capabilities, the detailed use of tutorials can refer to http://www.open-open.com/jsoup/.

The first step is to access the target Web page

Document doc = Jsoup.connect ("http://bbs.my0511.com/f152b"). get ();

The second step is to use the Jsoup selector (more efficient with regular expressions) based on the specific elements of the page's desired content, in this case, the target page is a forum, and all we need to do is crawl the title and link address of all posts on the Forum's homepage.

Open the target URL first, use Google Browser to browse the structure of the web, find the structure of the corresponding content, as shown in the following figure

Then select the area

Elements links = doc.getelementsbyattributevalue ("id", "lphymodelsub");

Next, get the contents of the selection and save it to the array

for (Element link:links) {

Catchmodel C = new Catchmodel ();
String linkhref = "http://bbs.my0511.com" +link.parent (). attr ("href");
String LinkText = Link.text ();
C.settext (LinkText);
C.seturl (LINKHREF);
Fistcatchlist.add (c);
}

Such a simple crawl is done.

The next step is to crawl Sina Weibo, general HTTP access to Sina Weibo website is very simple HTML, because Sina Weibo homepage is dynamically generated with JS and to go through a number of HTTP requests and verification to access success, so in order to data capture simple, we go a backdoor, That is to visit Sina Weibo's handset-side, weibo.cn to crawl, but one of the problems is that Sina Weibo access regardless of which end of the need for mandatory login verification, so we need to be in the HTTP request with a cookie to authenticate users. Looking for a long time on the internet to use Webcontroller this open source crawler framework, access is simple and efficient, then write down we will see how to use this framework.

First you need to import dependent packages, Webcontroller ja packages and Selenium jar packs

Download Address: http://download.csdn.net/detail/u013407099/9409372

Using Selenium to get cookies (Weibocn.java) that landed on Sina Weibo weibo.cn
Crawl Sina Weibo and extract data (Weibocrawler.java) using Webcollector and acquired cookies

Weibocn.java

Import Java.util.Set;
Import Org.openqa.selenium.Cookie;
Import org.openqa.selenium.WebElement;

Import Org.openqa.selenium.htmlunit.HtmlUnitDriver; /** * Use Selenium to get access to Sina Weibo weibo.cn cookies * @author hu/public class WEIBOCN {/** * access to Sina Weibo cookies, this method for Weib o.cn valid, weibo.com Invalid * weibo.cn transmit data in clear text, please use small * @param username Sina Weibo username * @param password sina Weibo password * @re Turn * @throws Exception */public static String Getsinacookie (string Username, string password) throws exc
        eption{StringBuilder sb = new StringBuilder ();
        Htmlunitdriver Driver = new Htmlunitdriver ();
        Driver.setjavascriptenabled (TRUE);

        Driver.get ("http://login.weibo.cn/login/");
        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");
        Mobile.sendkeys (username);
        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");
        Pass.sendkeys (password); webelement rem = Driver.findelementbYcssselector ("Input[name=remember]");
        Rem.click ();
        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");

        Submit.click ();
        set<cookie> Cookieset = Driver.manage (). GetCookies ();
        Driver.close ();
        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}
        String result=sb.tostring ();
        if (Result.contains ("GSID_CTANDWM")) {return result;
        }else{throw new Exception ("Weibo login Failed"); }
    }

}

Weibocrawler.java

Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
Import Cn.edu.hfut.dmic.webcollector.model.Page;
Import Cn.edu.hfut.dmic.webcollector.net.HttpRequest;
Import Cn.edu.hfut.dmic.webcollector.net.HttpResponse;
Import Cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
Import org.jsoup.nodes.Element;

Import org.jsoup.select.Elements; /** * Crawls Sina Weibo and extracts data using webcollector and acquired cookies * @author hu/public class Weibocrawler extends Breadthcrawler {Strin

    g cookies;
        Public Weibocrawler (String Crawlpath, Boolean autoparse) throws Exception {super (Crawlpath, autoparse);
    /* Access to Sina Weibo cookies, account password in plaintext transmission, please use the small * * * cookies = Weibocn.getsinacookie ("Your username", "your password");  @Override public HttpResponse GetResponse (Crawldatum crawldatum) throws Exception {HttpRequest request
        = new HttpRequest (crawldatum);
        Request.setcookie (cookie);
    return Request.getresponse (); } @OVerride public void Visit (Page page, crawldatums next) {int pagenum = integer.valueof (Page.getmetadata ("Pagen
        Um "));
        * * Extract Micro bo * * Elements Weibos = page.select ("div.c");
        for (Element Weibo:weibos) {System.out.println ("first" + pagenum + "page \ T" + weibo.text ()); } public static void Main (string[] args) throws Exception {Weibocrawler crawler = new Weibocrawler ("we
        Ibo_crawler ", false);
        Crawler.setthreads (3); /* Crawl the first 5 pages of a person's microblog * for

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More