[Java] uses httpclient to implement a simple crawler, grab the fried egg and sister figure

Source: Internet
Author: User

First article, start with a simple crawler.

The worm's function is very simple, crawl to the "Fried Egg net Xxoo" webpage (http://jandan.net/ooxx/page-1537), parse out the sister map, save to Local.

Results First:

In terms of the procedure, the steps are divided into three stages:

1. Initiate an HTTP request to get the returned response content;

2, parse the content, separate the URL of the valid picture;

3, according to the URL of these pictures, generate pictures saved to local.

Start with detailed instructions:

Ready to work: HttpClient's jar package, accessed by http://hc.apache.org/to download itself.

Main program content:

 Public classSimplespider {//Start Page    Private Static Final intpage = 1538;  Public Static voidMain (string[] args) {//HttpClient Timeout ConfigurationRequestconfig GlobalConfig = Requestconfig.custom (). Setcookiespec (Cookiespecs.standard). Setconnectionrequesttimeout (6000). Setconnecttimeout (6000). build (); Closeablehttpclient httpClient=Httpclients.custom (). Setdefaultrequestconfig (GlobalConfig). build (); System.out.println ("After 5 seconds, start grabbing the fried egg ...");  for(inti = page; i > 0; i--) {            //Create a GET requestHttpGet HttpGet =NewHttpGet ("http://jandan.net/ooxx/page-" +i); Httpget.addheader ("User-agent", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/42.0.2311.152 safari/537.36 "); Httpget.addheader ("Cookie", "_gat=1; Nsfw-click-load=off; Gif-click-load=on; _ga=ga1.2.1861846600.1423061484 "); Try {                //dare not climb too fastThread.Sleep (5000); //send a request and executeCloseablehttpresponse response =Httpclient.execute (HttpGet); InputStream in=response.getentity (). getcontent (); String HTML=utils.convertstreamtostring (in); //Page Content parsing                NewThread (Newjiandanhtmlparser (HTML, i)). Start (); } Catch(Exception e) {e.printstacktrace (); }        }    }}

HttpClient is a very powerful tool that belongs to Apache under the project. If you just create a default httpclient instance, the code is simple and detailed in the official website manual.

You can see that the request header was added when a GET request was created. The first user-agent represents the browser used. Some sites require a clear understanding of the browser used by the user, and some do not. Personal speculation, some sites according to the user using the browser different display. The fried egg net here has to be added to the request header. The second cookie represents some user settings, which can be no. You can see it clearly using Chrome's developer tools. If HTTPS is encrypted, a special grab kit is required.

Page Content parsing

 Public classJiandanhtmlparserImplementsRunnable {PrivateString html; Private intpage;  PublicJiandanhtmlparser (String HTML,intpage) {         This. html =html;  This. page =page; } @Override Public voidrun () {System.out.println ("==========" +page+ "page ============"); List<String> list =NewArraylist<string>(); HTML= Html.substring (Html.indexof ("Commentlist")); string[] Images= Html.split ("li>");  for(String image:images) {string[] SS= Image.split ("BR");  for(String s:ss) {if(S.indexof (" 0) {                    Try{                        inti = S.indexof (". Length (); List.add (S.substring (i, S.indexof ("\" ", I + 1))); }Catch(Exception e) {System.out.println (s); }                                    }            }        }         for(String imageurl:list) {if(Imageurl.indexof ("Sina") >0){                NewThread (NewJiandanimagecreator (imageurl,page)). Start (); }        }    }}

This piece of code looks messy, but it's actually very simple. Simply put, parse the HTML string returned by response, intercept it, find the real content (image URL), and deposit it in a temporary container.

Create a picture class

 Public classJiandanimagecreatorImplementsRunnable {Private Static intCount = 0; PrivateString ImageUrl; Private intpage; //storage path, customizing    Private Static FinalString basepath = "E:/jiandan";  PublicJiandanimagecreator (String ImageUrl,intpage) {         This. IMAGEURL =ImageUrl;  This. page =page; } @Override Public voidrun () {File dir=NewFile (BasePath); if(!dir.exists ())            {Dir.mkdirs (); System.out.println ("The picture is stored in the" +basepath+ "directory"); } String imageName= Imageurl.substring (Imageurl.lastindexof ("/") +1); Try{File File=NewFile (basepath+ "/" +page+ "--" +imageName); OutputStream OS=Newfileoutputstream (file); //Create a URL objectURL url =NewURL (IMAGEURL); InputStream is=Url.openstream (); byte[] Buff =New byte[1024];  while(true) {                intreaded =is.read (Buff); if(readed = =-1) {                     Break; }                byte[] temp =New byte[readed]; System.arraycopy (Buff,0, temp, 0, readed); //Write FileOs.write (temp); } System.out.println ("+" + (count++) + "Zhang Sister:" +File.getabsolutepath ());             Is.close ();        Os.close (); } Catch(Exception e) {e.printstacktrace (); }    }}

Create a URL object based on the SRC address of each image, and then use the byte stream to generate a local file.

This procedure is relatively simple and purely entertaining. If you can let those who do not know httpclient interest in this library, then boundless beneficence.

GitHub Address: Https://github.com/nbsa/SimpleSpider

[Java] uses httpclient to implement a simple crawler, grab the fried egg and sister figure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.