Java web crawler Crawl Sina Weibo personal microblog record __java

Source: Internet
Author: User
Tags java web

Before the topic, first understand the Java Crawl Web page on the specific content of the method, which is called the web Crawler, in this article will only involve simple text information and link crawling. There are only two ways to access HTTP in Java, one is to use the httpconnection of the original ecology, and one is to use encapsulated plug-ins or frames, such as httpclient,okhttp. In the process of testing crawled Web page information, I was using the Jsoup tool, because the tool not only encapsulates HTTP access, there are powerful HTML parsing capabilities, the detailed use of tutorials can refer to http://www.open-open.com/jsoup/.

The first step is to access the target Web page

Document doc = Jsoup.connect ("http://bbs.my0511.com/f152b"). get ();

The second step is to use the Jsoup selector (more efficient with regular expressions) based on the specific elements of the page's desired content, in this case, the target page is a forum, and all we need to do is crawl the title and link address of all posts on the Forum's homepage.

Open the target URL first, use Google Browser to browse the structure of the web, find the structure of the corresponding content, as shown in the following figure

Then select the area

Elements links = doc.getelementsbyattributevalue ("id", "lphymodelsub");

Next, get the contents of the selection and save it to the array

for (Element link:links) {

Catchmodel C = new Catchmodel ();
String linkhref = "http://bbs.my0511.com" +link.parent (). attr ("href");
String LinkText = Link.text ();
C.settext (LinkText);
C.seturl (LINKHREF);
Fistcatchlist.add (c);
}

Such a simple crawl is done.

The next step is to crawl Sina Weibo, general HTTP access to Sina Weibo website is very simple HTML, because Sina Weibo homepage is dynamically generated with JS and to go through a number of HTTP requests and verification to access success, so in order to data capture simple, we go a backdoor, That is to visit Sina Weibo's handset-side, weibo.cn to crawl, but one of the problems is that Sina Weibo access regardless of which end of the need for mandatory login verification, so we need to be in the HTTP request with a cookie to authenticate users. Looking for a long time on the internet to use Webcontroller this open source crawler framework, access is simple and efficient, then write down we will see how to use this framework.

First you need to import dependent packages, Webcontroller ja packages and Selenium jar packs

Download Address: http://download.csdn.net/detail/u013407099/9409372

Using Selenium to get cookies (Weibocn.java) that landed on Sina Weibo weibo.cn
Crawl Sina Weibo and extract data (Weibocrawler.java) using Webcollector and acquired cookies


Weibocn.java

Import Java.util.Set;
Import Org.openqa.selenium.Cookie;
Import org.openqa.selenium.WebElement;

Import Org.openqa.selenium.htmlunit.HtmlUnitDriver; /** * Use Selenium to get access to Sina Weibo weibo.cn cookies * @author hu/public class WEIBOCN {/** * access to Sina Weibo cookies, this method for Weib o.cn valid, weibo.com Invalid * weibo.cn transmit data in clear text, please use small * @param username Sina Weibo username * @param password sina Weibo password * @re Turn * @throws Exception */public static String Getsinacookie (string Username, string password) throws exc
        eption{StringBuilder sb = new StringBuilder ();
        Htmlunitdriver Driver = new Htmlunitdriver ();
        Driver.setjavascriptenabled (TRUE);

        Driver.get ("http://login.weibo.cn/login/");
        Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");
        Mobile.sendkeys (username);
        Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");
        Pass.sendkeys (password); webelement rem = Driver.findelementbYcssselector ("Input[name=remember]");
        Rem.click ();
        Webelement submit = Driver.findelementbycssselector ("input[name=submit]");

        Submit.click ();
        set<cookie> Cookieset = Driver.manage (). GetCookies ();
        Driver.close ();
        for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}
        String result=sb.tostring ();
        if (Result.contains ("GSID_CTANDWM")) {return result;
        }else{throw new Exception ("Weibo login Failed"); }
    }

}

Weibocrawler.java

Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
Import Cn.edu.hfut.dmic.webcollector.model.Page;
Import Cn.edu.hfut.dmic.webcollector.net.HttpRequest;
Import Cn.edu.hfut.dmic.webcollector.net.HttpResponse;
Import Cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
Import org.jsoup.nodes.Element;

Import org.jsoup.select.Elements; /** * Crawls Sina Weibo and extracts data using webcollector and acquired cookies * @author hu/public class Weibocrawler extends Breadthcrawler {Strin

    g cookies;
        Public Weibocrawler (String Crawlpath, Boolean autoparse) throws Exception {super (Crawlpath, autoparse);
    /* Access to Sina Weibo cookies, account password in plaintext transmission, please use the small * * * cookies = Weibocn.getsinacookie ("Your username", "your password");  @Override public HttpResponse GetResponse (Crawldatum crawldatum) throws Exception {HttpRequest request
        = new HttpRequest (crawldatum);
        Request.setcookie (cookie);
    return Request.getresponse (); } @OVerride public void Visit (Page page, crawldatums next) {int pagenum = integer.valueof (Page.getmetadata ("Pagen
        Um "));
        * * Extract Micro bo * * Elements Weibos = page.select ("div.c");
        for (Element Weibo:weibos) {System.out.println ("first" + pagenum + "page \ T" + weibo.text ()); } public static void Main (string[] args) throws Exception {Weibocrawler crawler = new Weibocrawler ("we
        Ibo_crawler ", false);
        Crawler.setthreads (3); /* Crawl the first 5 pages of a person's microblog * for

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.