Java to the final exam, the teacher unexpectedly said no test volume, we write procedures to grade ... I'm not a little defensive ...
Anyway, I'm going to write a Baidu stick crawler to him, in order to facilitate the use of Jsoup to parse crawl.
Use our school bar to carry out the experiment (Guilin University of Technology), this is just a simple test, do not like to spray.
Use Jsoup to parse the crawl.
Document doc = Jsoup.connect ("Http://tieba.baidu.com/f?ie=utf-8&kw=%E9%83%91%E7%A7%80%E5%A6%8D")/GUI work bar Web site. UserAgent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15). Timeout (3000)//Set the connection timeout. get (); Accessing URLs using the Get method
Use the Chrome browser to open GUI Gong Bar official website, press the F12 key, search a post, as shown in the picture
<a href= "/p/4201959797"
We have to find a hyperlink to the post in the total bar, the "/p/4201959797" above is the data we need to find
It can be found with the following code, Elements baidupost=doc.select ("A.j_th_tit");, A is the matching <a> hyperlink label,. J_th_tit is the class in the tag, so Doc.select (a.j_th _tit) means looking for HTML with A.j_th_tit class in all <a> tags
</pre><pre name= "code" class= "java" ><span style= "White-space:pre" > </span>elements Baidupost=doc.select ("A.j_th_tit");/Post URL System.out.println (baidupost.attr ("href");
The output is as follows (in fact it matches all out, I only output one here)
After access to the data, post bar is the beginning of the http://tieba.baidu.com, matching the resulting URL to access. The code is as follows
public static Document gethtml (String url,int page) throws IOException {
//page pages http://tieba.baidu.com/p/ 4201959797?pn=2
Document doc = jsoup.connect ("http://tieba.baidu.com" +url+ "pn=" +page)
. useragent (
" mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
. Timeout (3000). Get
();
try {
Thread.Sleep (3000);//pause 3 seconds, too fast will be blocked IP
} catch (Interruptedexception e) {
e.printstacktrace ();
} return
doc;
}
So it goes into the post, where we only collect user names, information, and floors.
Match the title, use Getelementsbyclass to find the Core_title_txt class, this is the title class
Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title
Content and User name
Elements Results=doc1.getelementsbyclass ("D_post_content_main");/the contents of the class containing the content
Elements username= Doc1.getelementsbyclass ("icon");//The contents of the class of the user name for
(int i = 0; i < results.size (); i++)
{
element result = R Esults.get (i);
Element links = Result.getelementsbytag ("div"). Get (0);//content
element username = username.get (i);
Elements name = username.getelementsbytag ("img");//user name
System.out.println (i+ " +" " +name.attr ("username") + " " +links.text ());
}
Here basically can crawl out of a post page all the content we need, of course, some posts too long need to page, then we have to find how many pages of posts, code as follows
public static int GetPage (Document doc)
{
Elements resultpage=doc.getelementsbyclass ("L_reply_num");
Element result0 = resultpage.get (0);
Element Result1=result0.getelementsbytag ("span"). Get (1);
System.out.println (Result1.text ());
String Number=result1.text ();
int page = integer.parseint (number);//string converted to int
return page;//number of pages
}
All right, all you need is to have it, set the loop to crawl, the total generation is as follows
Teste.java
Package com.wuxin.main;
Import java.io.IOException;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements;
Import Com.wuxin.data.PostBar;
public class teste{public static void Main (string[] args) {try {int page=1; Document doc=postbar.gethtml ();//Visit Total bar Elements baidupost=doc.select ("A.j_th_tit");/Post URL system.out.p
Rintln (baidupost.attr ("href"));
for (Element baidupost:baidupost) {page=1;
System.out.println (baidupost.attr ("href"));
Document doc0=page.gethtml (baidupost.attr ("href"), page);//post HTML int wu =page.getpage (doc0);/Post page do {
Document doc1=page.gethtml (baidupost.attr ("href"), Page); Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title System.out.println ("title:" +resulttitl
E.text ());
System.out.println ("First" +page+ "page"); Elements Results=doc1.getelementsbyclass ("D_post_content_main");/content Elements Username=doc1.getelementsbyclass ("icon")//username for (int i = 0; i < results.size (); i++)
{Element result = Results.get (i);
Element links = Result.getelementsbytag ("div"). Get (0);
Element username = username.get (i);
Elements name = Username.getelementsbytag ("img");
System.out.println (i+ "" + "" +name.attr ("username") + "" +links.text ());
} while (wu!=page++);
} catch (IOException e) {e.printstacktrace (); }
}
}
Page.java
Package com.wuxin.main;
Import java.io.IOException;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements; public class Page {public static Document gethtml (String url,int page) throws IOException {Document doc = jsoup.co Nnect ("http://tieba.baidu.com" +url+ "pn=" +page). useragent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN;
rv:1.9.2.15) "). Timeout (3000). get ();
try {thread.sleep (3000);//pause 3 seconds, too fast will be blocked IP} catch (Interruptedexception e) {e.printstacktrace ();}
return doc;
public static int GetPage (Document doc) {Elements resultpage=doc.getelementsbyclass ("L_reply_num");
Element result0 = resultpage.get (0);
Element Result1=result0.getelementsbytag ("span"). Get (1);
System.out.println (Result1.text ());
String Number=result1.text (); int page = integer.parseint (number);//string translates into int return page; }
}
Postbar.java
Package com.wuxin.data;
Import java.io.IOException;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
public class PostBar {public
static Document gethtml () throws IOException
{
Document doc = Jsoup.connect ("HT tp://tieba.baidu.com/f?ie=utf-8&kw=%e9%83%91%e7%a7%80%e5%a6%8d ")//want to climb which bar to change the URL here
. useragent (
" mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
. Timeout (5000). Get
();
return doc;
}
Run the results as shown in the figure
The code is simple, just crawl the bar 3 attribute data, if you still want to climb the picture or the building and so on to expand the implementation of it, it is not difficult, can also be optimized on this basis to crawl or store data to the database (I heard that the big data is very valuable now, Well, this data can be sold after all, haha.
Want to crawl other post bar to the Postbar.java in the access URL to replace the line, I tested 2, can climb.
PS: The weather is cold, people are a little lazy. So this crawler can only crawl the top 50 posts, so want to climb more words on the reference to the top of the page or their own thinking about how to turn the page
SOURCE download