First article, start with a simple crawler.
The worm's function is very simple, crawl to the "Fried Egg net Xxoo" webpage (http://jandan.net/ooxx/page-1537), parse out the sister map, save to Local.
Results First:
In terms of the procedure, the steps are divided into three stages:
1. Initiate an HTTP request to get the returned response content;
2, parse the content, separate the URL of the valid picture;
3, according to the URL of these pictures, generate pictures saved to local.
Start with detailed instructions:
Ready to work: HttpClient's jar package, accessed by http://hc.apache.org/to download itself.
Main program content:
Public classSimplespider {//Start Page Private Static Final intpage = 1538; Public Static voidMain (string[] args) {//HttpClient Timeout ConfigurationRequestconfig GlobalConfig = Requestconfig.custom (). Setcookiespec (Cookiespecs.standard). Setconnectionrequesttimeout (6000). Setconnecttimeout (6000). build (); Closeablehttpclient httpClient=Httpclients.custom (). Setdefaultrequestconfig (GlobalConfig). build (); System.out.println ("After 5 seconds, start grabbing the fried egg ..."); for(inti = page; i > 0; i--) { //Create a GET requestHttpGet HttpGet =NewHttpGet ("http://jandan.net/ooxx/page-" +i); Httpget.addheader ("User-agent", "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/42.0.2311.152 safari/537.36 "); Httpget.addheader ("Cookie", "_gat=1; Nsfw-click-load=off; Gif-click-load=on; _ga=ga1.2.1861846600.1423061484 "); Try { //dare not climb too fastThread.Sleep (5000); //send a request and executeCloseablehttpresponse response =Httpclient.execute (HttpGet); InputStream in=response.getentity (). getcontent (); String HTML=utils.convertstreamtostring (in); //Page Content parsing NewThread (Newjiandanhtmlparser (HTML, i)). Start (); } Catch(Exception e) {e.printstacktrace (); } } }}
HttpClient is a very powerful tool that belongs to Apache under the project. If you just create a default httpclient instance, the code is simple and detailed in the official website manual.
You can see that the request header was added when a GET request was created. The first user-agent represents the browser used. Some sites require a clear understanding of the browser used by the user, and some do not. Personal speculation, some sites according to the user using the browser different display. The fried egg net here has to be added to the request header. The second cookie represents some user settings, which can be no. You can see it clearly using Chrome's developer tools. If HTTPS is encrypted, a special grab kit is required.
Page Content parsing
Public classJiandanhtmlparserImplementsRunnable {PrivateString html; Private intpage; PublicJiandanhtmlparser (String HTML,intpage) { This. html =html; This. page =page; } @Override Public voidrun () {System.out.println ("==========" +page+ "page ============"); List<String> list =NewArraylist<string>(); HTML= Html.substring (Html.indexof ("Commentlist")); string[] Images= Html.split ("li>"); for(String image:images) {string[] SS= Image.split ("BR"); for(String s:ss) {if(S.indexof (" 0) { Try{ inti = S.indexof (". Length (); List.add (S.substring (i, S.indexof ("\" ", I + 1))); }Catch(Exception e) {System.out.println (s); } } } } for(String imageurl:list) {if(Imageurl.indexof ("Sina") >0){ NewThread (NewJiandanimagecreator (imageurl,page)). Start (); } } }}
This piece of code looks messy, but it's actually very simple. Simply put, parse the HTML string returned by response, intercept it, find the real content (image URL), and deposit it in a temporary container.
Create a picture class
Public classJiandanimagecreatorImplementsRunnable {Private Static intCount = 0; PrivateString ImageUrl; Private intpage; //storage path, customizing Private Static FinalString basepath = "E:/jiandan"; PublicJiandanimagecreator (String ImageUrl,intpage) { This. IMAGEURL =ImageUrl; This. page =page; } @Override Public voidrun () {File dir=NewFile (BasePath); if(!dir.exists ()) {Dir.mkdirs (); System.out.println ("The picture is stored in the" +basepath+ "directory"); } String imageName= Imageurl.substring (Imageurl.lastindexof ("/") +1); Try{File File=NewFile (basepath+ "/" +page+ "--" +imageName); OutputStream OS=Newfileoutputstream (file); //Create a URL objectURL url =NewURL (IMAGEURL); InputStream is=Url.openstream (); byte[] Buff =New byte[1024]; while(true) { intreaded =is.read (Buff); if(readed = =-1) { Break; } byte[] temp =New byte[readed]; System.arraycopy (Buff,0, temp, 0, readed); //Write FileOs.write (temp); } System.out.println ("+" + (count++) + "Zhang Sister:" +File.getabsolutepath ()); Is.close (); Os.close (); } Catch(Exception e) {e.printstacktrace (); } }}
Create a URL object based on the SRC address of each image, and then use the byte stream to generate a local file.
This procedure is relatively simple and purely entertaining. If you can let those who do not know httpclient interest in this library, then boundless beneficence.
GitHub Address: Https://github.com/nbsa/SimpleSpider
[Java] uses httpclient to implement a simple crawler, grab the fried egg and sister figure