A simple and rude Crawler
1. Bing today's meitu
When bing search is used, a pair of meitu appears every day.
Bing today's meitu http://bing.plmeizi.com/(here collected more than a year of today's meitu) collectors: http://leil.plmeizi.com/
47 pages
Url format by http://bing.plmeizi.com /? Page = *
Click it to enter the name and name we want.
2. Start Encoding
Using simple Jsoup for crawling is very simple and easy to understand.
HrmlUtil
1 package util; 2 3 import java. io. IOException; 4 5 import org. jsoup. jsoup; 6 import org. jsoup. nodes. document; 7 8 public class HtmlUtil {9 // obtain the webpage text 10 public Document getHtmlTextByUrl (String url) from the url {11 Document doc = null; 12 try {13 // doc = Jsoup. connect (url ). timeout (5000000 ). get (); 14 int I = (int) (Math. random () * 1000); // creates a random latency to prevent websites from blocking 15 while (I! = 0) {16 I --; 17} 18 doc = Jsoup. connect (url ). data ("query", "Java "). userAgent ("Mozilla "). cookie ("auth", "token "). timeout (300000) 19. post (); 20} catch (IOException e) {21 e. printStackTrace (); 22 try {23 doc = Jsoup. connect (url ). timeout (5000000 ). get (); 24} catch (IOException e1) {25 // TODO Auto-generated catch block26 e1.printStackTrace (); 27} 28} 29 return doc; 30} 31}
GetPhoto
This encoding mainly needs to analyze the html attributes and obtain the elements and element values according to the attributes.
I first get the page url of each graph.
Go to the details page and get the image url. The image name is truncated.
Save the graph to a local device.
1 package bing; 2 3 import java. io. dataInputStream; 4 import java. io. file; 5 import java. io. fileOutputStream; 6 import java. io. IOException; 7 import java.net. URL; 8 9 import org. jsoup. nodes. document; 10 import org. jsoup. nodes. element; 11 import org. jsoup. select. elements; 12 13 import util. htmlUtil; 14 15/** 16*17 * @ author loveincode18 * @ data Sep 29,201 7 1:15:00 PM19 */20 public class GetPhoto {21 2 2 public static void go (int startpage, int endpage) throws IOException {23 24 HtmlUtil htmlutil = new HtmlUtil (); 25 // get the absolute path of the image 26 String url = "http://bing.plmeizi.com /? Page = "; 27 for (int I = startpage; I <= endpage; I ++) {28 String gourl = url + I +" "; 29 Document dochtml = htmlutil. getHtmlTextByUrl (gourl); 30 Elements elements_a = dochtml. getElementsByClass ("item"); 31 for (int x = 0; x <elements_a.size (); x ++) {32 String pyotopage = elements_a.get (x ). attr ("href"); 33 Document dochtml_photo = htmlutil. getHtmlTextByUrl (pyotopage); 34 Element elements_picurl = require ("picurl"); 35 String picurl = elements_picurl.attr ("href"); 36 Element elements_searchlink = require ("searchlink "); 37 String name = elements_searchlink.getElementsByTag ("span" (.get(0).html (); 38 name = name. split ("\ (") [0]; 39 40 if (picurl. contains ("jpg") {41 // download image 42 URL url_pic = new URL (picurl); 43 DataInputStream dataInputStream = new DataInputStream (url_pic.openStream ()); 44 String imageName = name + ". jpg "; 45 FileOutputStream fileOutputStream = new FileOutputStream (new File (" bing_pic/"+ imageName); 46 byte [] buffer = new byte [1024]; 47 int length; 48 while (length = dataInputStream. read (buffer)> 0) {49 fileOutputStream. write (buffer, 0, length); 50} 51 dataInputStream. close (); 52 fileOutputStream. close (); 53} 54} 55} 56 57} 58 59 public static void main (String [] args) throws IOException {60 System. out. println ("test"); 61 go (1, 1); 62} 63 64}
Mythread
1 package bing; 2 3 import java.io.IOException; 4 5 public class Mythread extends Thread { 6 7 private int startpage; 8 9 private int endpage;10 11 public Mythread(int startpage, int endpage) {12 this.startpage = startpage;13 this.endpage = endpage;14 }15 16 @SuppressWarnings("static-access")17 @Override18 public void run() {19 GetPhoto getPhoto = new GetPhoto();20 try {21 getPhoto.go(startpage, endpage);22 } catch (IOException e) {23 // TODO Auto-generated catch block24 e.printStackTrace();25 }26 }27 }RUN
RUN
Multithreading is used to enable multiple threads to crawl images simultaneously
1 package bing; 2 3 import java. io. IOException; 4 5/** 6*7 * @ author loveincode 8 * @ data Sep 29,201 7 1:55:57 PM 9 */10 public class RUN {11 12 public static void main (String [] args) throws IOException {13 14 long startTime = System. currentTimeMillis (); // get start time 15 16 Mythread a1 = new Mythread (1, 5); 17 Mythread a2 = new Mythread (6, 10 ); 18 Mythread a3 = new Mythread (11, 15); 19 Mythread a4 = new Mythread (16, 20); 20 Mythread a5 = new Mythread (21, 25 ); 21 Mythread a6 = new Mythread (26, 30); 22 Mythread a7 = new Mythread (31, 35); 23 Mythread a8 = new Mythread (36, 40 ); 24 Mythread a9 = new Mythread (41, 45); 25 Mythread a10 = new Mythread (46, 47); 26 27 a1.start (); 28 a2.start (); 29 a3.start (); 30 a4.start (); 31 a5.start (); 32 a6.start (); 33 a7.start (); 34 a8.start (); 35 a9.start (); 36 a10.start (); 37 38 while (true) {39 if (a1.isAlive () = false & a2.isAlive () = false & a3.isAlive () = false & a4.isAlive () = false40 & a5.isAlive () = false & a6.isAlive () = false & a7.isAlive () = false & a8.isAlive () = false41 & a9.isAlive () = false & a10.isAlive () = false) {42 long endTime = System. currentTimeMillis (); // get the end time 43 System. out. println ("program running time:" + (endTime-startTime)/1000.0 + "s"); 44 break; 45} 46} 47} 48 49}
RUN
It takes 76.962 s to download the image to the local device.
Successful
Effect:
Very high