Before the topic, first understand the Java Crawl Web page on the specific content of the method, which is called the web Crawler, in this article will only involve simple text information and link crawling. There are only two ways to access HTTP in Java, one is to use the httpconnection of the original ecology, and one is to use encapsulated plug-ins or frames, such as httpclient,okhttp. In the process of testing crawled Web page information, I was using the Jsoup tool, because the tool not only encapsulates HTTP access, there are powerful HTML parsing capabilities, the detailed use of tutorials can refer to http://www.open-open.com/jsoup/.
The first step is to access the target Web page
Document doc = Jsoup.connect ("http://bbs.my0511.com/f152b"). get ();
The second step is to use the Jsoup selector (more efficient with regular expressions) based on the specific elements of the page's desired content, in this case, the target page is a forum, and all we need to do is crawl the title and link address of all posts on the Forum's homepage.
Open the target URL first, use Google Browser to browse the structure of the web, find the structure of the corresponding content, as shown in the following figure
Then select the area
Elements links = doc.getelementsbyattributevalue ("id", "lphymodelsub");
Next, get the contents of the selection and save it to the array
for (Element link:links) {
Catchmodel C = new Catchmodel ();
String linkhref = "http://bbs.my0511.com" +link.parent (). attr ("href");
String LinkText = Link.text ();
C.settext (LinkText);
C.seturl (LINKHREF);
Fistcatchlist.add (c);
}
Such a simple crawl is done.
The next step is to crawl Sina Weibo, general HTTP access to Sina Weibo website is very simple HTML, because Sina Weibo homepage is dynamically generated with JS and to go through a number of HTTP requests and verification to access success, so in order to data capture simple, we go a backdoor, That is to visit Sina Weibo's handset-side, weibo.cn to crawl, but one of the problems is that Sina Weibo access regardless of which end of the need for mandatory login verification, so we need to be in the HTTP request with a cookie to authenticate users. Looking for a long time on the internet to use Webcontroller this open source crawler framework, access is simple and efficient, then write down we will see how to use this framework.
First you need to import dependent packages, Webcontroller ja packages and Selenium jar packs
Download Address: http://download.csdn.net/detail/u013407099/9409372
Using Selenium to get cookies (Weibocn.java) that landed on Sina Weibo weibo.cn
Crawl Sina Weibo and extract data (Weibocrawler.java) using Webcollector and acquired cookies
Weibocn.java
Import Java.util.Set;
Import Org.openqa.selenium.Cookie;
Import org.openqa.selenium.WebElement;
Import Org.openqa.selenium.htmlunit.HtmlUnitDriver; /** * Use Selenium to get access to Sina Weibo weibo.cn cookies * @author hu/public class WEIBOCN {/** * access to Sina Weibo cookies, this method for Weib o.cn valid, weibo.com Invalid * weibo.cn transmit data in clear text, please use small * @param username Sina Weibo username * @param password sina Weibo password * @re Turn * @throws Exception */public static String Getsinacookie (string Username, string password) throws exc
eption{StringBuilder sb = new StringBuilder ();
Htmlunitdriver Driver = new Htmlunitdriver ();
Driver.setjavascriptenabled (TRUE);
Driver.get ("http://login.weibo.cn/login/");
Webelement mobile = Driver.findelementbycssselector ("input[name=mobile]");
Mobile.sendkeys (username);
Webelement pass = Driver.findelementbycssselector ("Input[name^=password]");
Pass.sendkeys (password); webelement rem = Driver.findelementbYcssselector ("Input[name=remember]");
Rem.click ();
Webelement submit = Driver.findelementbycssselector ("input[name=submit]");
Submit.click ();
set<cookie> Cookieset = Driver.manage (). GetCookies ();
Driver.close ();
for (Cookie cookie:cookieset) {sb.append (Cookie.getname () + "=" +cookie.getvalue () + ";");}
String result=sb.tostring ();
if (Result.contains ("GSID_CTANDWM")) {return result;
}else{throw new Exception ("Weibo login Failed"); }
}
}
Weibocrawler.java
Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
Import Cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
Import Cn.edu.hfut.dmic.webcollector.model.Page;
Import Cn.edu.hfut.dmic.webcollector.net.HttpRequest;
Import Cn.edu.hfut.dmic.webcollector.net.HttpResponse;
Import Cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements; /** * Crawls Sina Weibo and extracts data using webcollector and acquired cookies * @author hu/public class Weibocrawler extends Breadthcrawler {Strin
g cookies;
Public Weibocrawler (String Crawlpath, Boolean autoparse) throws Exception {super (Crawlpath, autoparse);
/* Access to Sina Weibo cookies, account password in plaintext transmission, please use the small * * * cookies = Weibocn.getsinacookie ("Your username", "your password"); @Override public HttpResponse GetResponse (Crawldatum crawldatum) throws Exception {HttpRequest request
= new HttpRequest (crawldatum);
Request.setcookie (cookie);
return Request.getresponse (); } @OVerride public void Visit (Page page, crawldatums next) {int pagenum = integer.valueof (Page.getmetadata ("Pagen
Um "));
* * Extract Micro bo * * Elements Weibos = page.select ("div.c");
for (Element Weibo:weibos) {System.out.println ("first" + pagenum + "page \ T" + weibo.text ()); } public static void Main (string[] args) throws Exception {Weibocrawler crawler = new Weibocrawler ("we
Ibo_crawler ", false);
Crawler.setthreads (3); /* Crawl the first 5 pages of a person's microblog * for