Java Crawler crawl Baidu Bar

Source: Internet
Author: User

Java to the final exam, the teacher unexpectedly said no test volume, we write procedures to grade ... I'm not a little defensive ...

Anyway, I'm going to write a Baidu stick crawler to him, in order to facilitate the use of Jsoup to parse crawl.

Use our school bar to carry out the experiment (Guilin University of Technology), this is just a simple test, do not like to spray.

Use Jsoup to parse the crawl.

Document doc = Jsoup.connect ("Http://tieba.baidu.com/f?ie=utf-8&kw=%E9%83%91%E7%A7%80%E5%A6%8D")/GUI work bar Web site. UserAgent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN;                 rv:1.9.2.15). Timeout (3000)//Set the connection timeout. get (); Accessing URLs using the Get method

Use the Chrome browser to open GUI Gong Bar official website, press the F12 key, search a post, as shown in the picture

<a href= "/p/4201959797"
We have to find a hyperlink to the post in the total bar, the "/p/4201959797" above is the data we need to find

It can be found with the following code, Elements baidupost=doc.select ("A.j_th_tit");, A is the matching <a> hyperlink label,. J_th_tit is the class in the tag, so Doc.select (a.j_th _tit) means looking for HTML with A.j_th_tit class in all <a> tags

</pre><pre name= "code" class= "java" ><span style= "White-space:pre" >	</span>elements Baidupost=doc.select ("A.j_th_tit");/Post URL   System.out.println (baidupost.attr ("href");

The output is as follows (in fact it matches all out, I only output one here)


After access to the data, post bar is the beginning of the http://tieba.baidu.com, matching the resulting URL to access. The code is as follows

	public static Document gethtml (String url,int page) throws IOException {
		
	//page pages                   http://tieba.baidu.com/p/ 4201959797?pn=2
	Document doc = jsoup.connect ("http://tieba.baidu.com" +url+ "pn=" +page)
            
    		
            . useragent (  
                        " mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
                         
            . Timeout (3000). Get
            ();
	
	try {
        Thread.Sleep (3000);//pause 3 seconds, too fast will be blocked IP
} catch (Interruptedexception e) {
        e.printstacktrace ();
} return
	
	doc; 
	}

So it goes into the post, where we only collect user names, information, and floors.

Match the title, use Getelementsbyclass to find the Core_title_txt class, this is the title class

Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title
Content and User name
   Elements Results=doc1.getelementsbyclass ("D_post_content_main");/the contents of the class containing the content
         Elements username= Doc1.getelementsbyclass ("icon");//The contents of the class of the user name for

         (int i = 0; i < results.size (); i++) 
         {
         	 element result = R Esults.get (i);
         	 Element links =  Result.getelementsbytag ("div"). Get (0);//content
         	 element username = username.get (i);
         	 Elements name =  username.getelementsbytag ("img");//user name
         	 
         System.out.println (i+ "   +" "  +name.attr ("username") + "                                          " +links.text ());
         }
Here basically can crawl out of a post page all the content we need, of course, some posts too long need to page, then we have to find how many pages of posts, code as follows

public static int GetPage (Document doc)
	{
		 Elements resultpage=doc.getelementsbyclass ("L_reply_num");
		 Element result0 = resultpage.get (0);
		 Element Result1=result0.getelementsbytag ("span"). Get (1);
		 System.out.println (Result1.text ());
		 String Number=result1.text ();
		 int page = integer.parseint (number);//string converted to int

		return page;//number of pages
		
	}
All right, all you need is to have it, set the loop to crawl, the total generation is as follows

Teste.java

Package com.wuxin.main;  
Import java.io.IOException;  
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element; 
Import org.jsoup.select.Elements;

Import Com.wuxin.data.PostBar;
	 
	 public class teste{public static void Main (string[] args) {try {int page=1; Document doc=postbar.gethtml ();//Visit Total bar Elements baidupost=doc.select ("A.j_th_tit");/Post URL system.out.p
     Rintln (baidupost.attr ("href"));
     	for (Element baidupost:baidupost) {page=1;
     	System.out.println (baidupost.attr ("href"));
			
     			Document doc0=page.gethtml (baidupost.attr ("href"), page);//post HTML int wu =page.getpage (doc0);/Post page do {
     	
     			Document doc1=page.gethtml (baidupost.attr ("href"), Page); Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title System.out.println ("title:" +resulttitl
     			E.text ());
     			System.out.println ("First" +page+ "page"); Elements Results=doc1.getelementsbyclass ("D_post_content_main");/content Elements Username=doc1.getelementsbyclass ("icon")//username for (int i = 0; i < results.size (); i++)
         	 {Element result = Results.get (i);
         	 Element links = Result.getelementsbytag ("div"). Get (0);
         	 Element username = username.get (i);
         	 
         Elements name = Username.getelementsbytag ("img");
         System.out.println (i+ "" + "" +name.attr ("username") + "" +links.text ());
     	
     } while (wu!=page++);  
        } catch (IOException e) {e.printstacktrace (); }   
     

	}
}

Page.java

Package com.wuxin.main;

Import java.io.IOException;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;

Import org.jsoup.select.Elements; public class Page {public static Document gethtml (String url,int page) throws IOException {Document doc = jsoup.co Nnect ("http://tieba.baidu.com" +url+ "pn=" +page). useragent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN;
	
	rv:1.9.2.15) "). Timeout (3000). get ();
	
	try {thread.sleep (3000);//pause 3 seconds, too fast will be blocked IP} catch (Interruptedexception e) {e.printstacktrace ();} 
return doc;
		 public static int GetPage (Document doc) {Elements resultpage=doc.getelementsbyclass ("L_reply_num");
		 Element result0 = resultpage.get (0);
		 Element Result1=result0.getelementsbytag ("span"). Get (1);
		 System.out.println (Result1.text ());
		 String Number=result1.text (); int page = integer.parseint (number);//string translates into int return page; }
}


Postbar.java

Package com.wuxin.data;  

Import java.io.IOException;

Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;

public class PostBar {public
	
	static Document gethtml () throws IOException 
	{
		Document doc = Jsoup.connect ("HT tp://tieba.baidu.com/f?ie=utf-8&kw=%e9%83%91%e7%a7%80%e5%a6%8d ")//want to climb which bar to change the URL here

	            . useragent (  
	                        " mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
	                         
	            . Timeout (5000). Get
	            ();
		
		return doc; 

	}


Run the results as shown in the figure


The code is simple, just crawl the bar 3 attribute data, if you still want to climb the picture or the building and so on to expand the implementation of it, it is not difficult, can also be optimized on this basis to crawl or store data to the database (I heard that the big data is very valuable now, Well, this data can be sold after all, haha.

Want to crawl other post bar to the Postbar.java in the access URL to replace the line, I tested 2, can climb.

PS: The weather is cold, people are a little lazy. So this crawler can only crawl the top 50 posts, so want to climb more words on the reference to the top of the page or their own thinking about how to turn the page

SOURCE download

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.