Java Crawler crawl Baidu Bar

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java to the final exam, the teacher unexpectedly said no test volume, we write procedures to grade ... I'm not a little defensive ...

Anyway, I'm going to write a Baidu stick crawler to him, in order to facilitate the use of Jsoup to parse crawl.

Use our school bar to carry out the experiment (Guilin University of Technology), this is just a simple test, do not like to spray.

Use Jsoup to parse the crawl.

Document doc = Jsoup.connect ("Http://tieba.baidu.com/f?ie=utf-8&kw=%E9%83%91%E7%A7%80%E5%A6%8D")/GUI work bar Web site. UserAgent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15). Timeout (3000)//Set the connection timeout. get (); Accessing URLs using the Get method

Use the Chrome browser to open GUI Gong Bar official website, press the F12 key, search a post, as shown in the picture

<a href= "/p/4201959797"

We have to find a hyperlink to the post in the total bar, the "/p/4201959797" above is the data we need to find

It can be found with the following code, Elements baidupost=doc.select ("A.j_th_tit");, A is the matching <a> hyperlink label,. J_th_tit is the class in the tag, so Doc.select (a.j_th _tit) means looking for HTML with A.j_th_tit class in all <a> tags

</pre><pre name= "code" class= "java" ><span style= "White-space:pre" >	</span>elements Baidupost=doc.select ("A.j_th_tit");/Post URL   System.out.println (baidupost.attr ("href");

The output is as follows (in fact it matches all out, I only output one here)

After access to the data, post bar is the beginning of the http://tieba.baidu.com, matching the resulting URL to access. The code is as follows

	public static Document gethtml (String url,int page) throws IOException {
		
	//page pages                   http://tieba.baidu.com/p/ 4201959797?pn=2
	Document doc = jsoup.connect ("http://tieba.baidu.com" +url+ "pn=" +page)
            
    		
            . useragent (  
                        " mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
                         
            . Timeout (3000). Get
            ();
	
	try {
        Thread.Sleep (3000);//pause 3 seconds, too fast will be blocked IP
} catch (Interruptedexception e) {
        e.printstacktrace ();
} return
	
	doc; 
	}

So it goes into the post, where we only collect user names, information, and floors.

Match the title, use Getelementsbyclass to find the Core_title_txt class, this is the title class

Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title

Content and User name

   Elements Results=doc1.getelementsbyclass ("D_post_content_main");/the contents of the class containing the content
         Elements username= Doc1.getelementsbyclass ("icon");//The contents of the class of the user name for

         (int i = 0; i < results.size (); i++) 
         {
         	 element result = R Esults.get (i);
         	 Element links =  Result.getelementsbytag ("div"). Get (0);//content
         	 element username = username.get (i);
         	 Elements name =  username.getelementsbytag ("img");//user name
         	 
         System.out.println (i+ "   +" "  +name.attr ("username") + "                                          " +links.text ());
         }

Here basically can crawl out of a post page all the content we need, of course, some posts too long need to page, then we have to find how many pages of posts, code as follows

public static int GetPage (Document doc)
	{
		 Elements resultpage=doc.getelementsbyclass ("L_reply_num");
		 Element result0 = resultpage.get (0);
		 Element Result1=result0.getelementsbytag ("span"). Get (1);
		 System.out.println (Result1.text ());
		 String Number=result1.text ();
		 int page = integer.parseint (number);//string converted to int

		return page;//number of pages
		
	}

All right, all you need is to have it, set the loop to crawl, the total generation is as follows

Teste.java

Package com.wuxin.main;  
Import java.io.IOException;  
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element; 
Import org.jsoup.select.Elements;

Import Com.wuxin.data.PostBar;
	 
	 public class teste{public static void Main (string[] args) {try {int page=1; Document doc=postbar.gethtml ();//Visit Total bar Elements baidupost=doc.select ("A.j_th_tit");/Post URL system.out.p
     Rintln (baidupost.attr ("href"));
     	for (Element baidupost:baidupost) {page=1;
     	System.out.println (baidupost.attr ("href"));
			
     			Document doc0=page.gethtml (baidupost.attr ("href"), page);//post HTML int wu =page.getpage (doc0);/Post page do {
     	
     			Document doc1=page.gethtml (baidupost.attr ("href"), Page); Elements Resulttitle=doc1.getelementsbyclass ("Core_title_txt");//title System.out.println ("title:" +resulttitl
     			E.text ());
     			System.out.println ("First" +page+ "page"); Elements Results=doc1.getelementsbyclass ("D_post_content_main");/content Elements Username=doc1.getelementsbyclass ("icon")//username for (int i = 0; i < results.size (); i++)
         	 {Element result = Results.get (i);
         	 Element links = Result.getelementsbytag ("div"). Get (0);
         	 Element username = username.get (i);
         	 
         Elements name = Username.getelementsbytag ("img");
         System.out.println (i+ "" + "" +name.attr ("username") + "" +links.text ());
     	
     } while (wu!=page++);  
        } catch (IOException e) {e.printstacktrace (); }   
     

	}
}

Page.java

Package com.wuxin.main;

Import java.io.IOException;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;

Import org.jsoup.select.Elements; public class Page {public static Document gethtml (String url,int page) throws IOException {Document doc = jsoup.co Nnect ("http://tieba.baidu.com" +url+ "pn=" +page). useragent ("mozilla/5.0" (Windows; U Windows NT 5.1; ZH-CN;
	
	rv:1.9.2.15) "). Timeout (3000). get ();
	
	try {thread.sleep (3000);//pause 3 seconds, too fast will be blocked IP} catch (Interruptedexception e) {e.printstacktrace ();} 
return doc;
		 public static int GetPage (Document doc) {Elements resultpage=doc.getelementsbyclass ("L_reply_num");
		 Element result0 = resultpage.get (0);
		 Element Result1=result0.getelementsbytag ("span"). Get (1);
		 System.out.println (Result1.text ());
		 String Number=result1.text (); int page = integer.parseint (number);//string translates into int return page; }
}

Postbar.java

Package com.wuxin.data;  

Import java.io.IOException;

Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;

public class PostBar {public
	
	static Document gethtml () throws IOException 
	{
		Document doc = Jsoup.connect ("HT tp://tieba.baidu.com/f?ie=utf-8&kw=%e9%83%91%e7%a7%80%e5%a6%8d ")//want to climb which bar to change the URL here

	            . useragent (  
	                        " mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9.2.15) ")
	                         
	            . Timeout (5000). Get
	            ();
		
		return doc; 

	}

Run the results as shown in the figure

The code is simple, just crawl the bar 3 attribute data, if you still want to climb the picture or the building and so on to expand the implementation of it, it is not difficult, can also be optimized on this basis to crawl or store data to the database (I heard that the big data is very valuable now, Well, this data can be sold after all, haha.

Want to crawl other post bar to the Postbar.java in the access URL to replace the line, I tested 2, can climb.

PS: The weather is cold, people are a little lazy. So this crawler can only crawl the top 50 posts, so want to climb more words on the reference to the top of the page or their own thinking about how to turn the page

SOURCE download

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Crawler crawl Baidu Bar

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Crawler crawl Baidu Bar

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support