A simple and rude Crawler

Source: Internet
Author: User

A simple and rude Crawler
1. Bing today's meitu

When bing search is used, a pair of meitu appears every day.

Bing today's meitu http://bing.plmeizi.com/(here collected more than a year of today's meitu) collectors: http://leil.plmeizi.com/

47 pages

Url format by http://bing.plmeizi.com /? Page = *

 

Click it to enter the name and name we want.

2. Start Encoding

Using simple Jsoup for crawling is very simple and easy to understand.

HrmlUtil
1 package util; 2 3 import java. io. IOException; 4 5 import org. jsoup. jsoup; 6 import org. jsoup. nodes. document; 7 8 public class HtmlUtil {9 // obtain the webpage text 10 public Document getHtmlTextByUrl (String url) from the url {11 Document doc = null; 12 try {13 // doc = Jsoup. connect (url ). timeout (5000000 ). get (); 14 int I = (int) (Math. random () * 1000); // creates a random latency to prevent websites from blocking 15 while (I! = 0) {16 I --; 17} 18 doc = Jsoup. connect (url ). data ("query", "Java "). userAgent ("Mozilla "). cookie ("auth", "token "). timeout (300000) 19. post (); 20} catch (IOException e) {21 e. printStackTrace (); 22 try {23 doc = Jsoup. connect (url ). timeout (5000000 ). get (); 24} catch (IOException e1) {25 // TODO Auto-generated catch block26 e1.printStackTrace (); 27} 28} 29 return doc; 30} 31}
GetPhoto

This encoding mainly needs to analyze the html attributes and obtain the elements and element values according to the attributes.

I first get the page url of each graph.

Go to the details page and get the image url. The image name is truncated.

Save the graph to a local device.

1 package bing; 2 3 import java. io. dataInputStream; 4 import java. io. file; 5 import java. io. fileOutputStream; 6 import java. io. IOException; 7 import java.net. URL; 8 9 import org. jsoup. nodes. document; 10 import org. jsoup. nodes. element; 11 import org. jsoup. select. elements; 12 13 import util. htmlUtil; 14 15/** 16*17 * @ author loveincode18 * @ data Sep 29,201 7 1:15:00 PM19 */20 public class GetPhoto {21 2 2 public static void go (int startpage, int endpage) throws IOException {23 24 HtmlUtil htmlutil = new HtmlUtil (); 25 // get the absolute path of the image 26 String url = "http://bing.plmeizi.com /? Page = "; 27 for (int I = startpage; I <= endpage; I ++) {28 String gourl = url + I +" "; 29 Document dochtml = htmlutil. getHtmlTextByUrl (gourl); 30 Elements elements_a = dochtml. getElementsByClass ("item"); 31 for (int x = 0; x <elements_a.size (); x ++) {32 String pyotopage = elements_a.get (x ). attr ("href"); 33 Document dochtml_photo = htmlutil. getHtmlTextByUrl (pyotopage); 34 Element elements_picurl = require ("picurl"); 35 String picurl = elements_picurl.attr ("href"); 36 Element elements_searchlink = require ("searchlink "); 37 String name = elements_searchlink.getElementsByTag ("span" (.get(0).html (); 38 name = name. split ("\ (") [0]; 39 40 if (picurl. contains ("jpg") {41 // download image 42 URL url_pic = new URL (picurl); 43 DataInputStream dataInputStream = new DataInputStream (url_pic.openStream ()); 44 String imageName = name + ". jpg "; 45 FileOutputStream fileOutputStream = new FileOutputStream (new File (" bing_pic/"+ imageName); 46 byte [] buffer = new byte [1024]; 47 int length; 48 while (length = dataInputStream. read (buffer)> 0) {49 fileOutputStream. write (buffer, 0, length); 50} 51 dataInputStream. close (); 52 fileOutputStream. close (); 53} 54} 55} 56 57} 58 59 public static void main (String [] args) throws IOException {60 System. out. println ("test"); 61 go (1, 1); 62} 63 64}
Mythread
 1 package bing; 2  3 import java.io.IOException; 4  5 public class Mythread extends Thread { 6  7     private int startpage; 8  9     private int endpage;10 11     public Mythread(int startpage, int endpage) {12         this.startpage = startpage;13         this.endpage = endpage;14     }15 16     @SuppressWarnings("static-access")17     @Override18     public void run() {19         GetPhoto getPhoto = new GetPhoto();20         try {21             getPhoto.go(startpage, endpage);22         } catch (IOException e) {23             // TODO Auto-generated catch block24             e.printStackTrace();25         }26     }27 }RUN
RUN

Multithreading is used to enable multiple threads to crawl images simultaneously

1 package bing; 2 3 import java. io. IOException; 4 5/** 6*7 * @ author loveincode 8 * @ data Sep 29,201 7 1:55:57 PM 9 */10 public class RUN {11 12 public static void main (String [] args) throws IOException {13 14 long startTime = System. currentTimeMillis (); // get start time 15 16 Mythread a1 = new Mythread (1, 5); 17 Mythread a2 = new Mythread (6, 10 ); 18 Mythread a3 = new Mythread (11, 15); 19 Mythread a4 = new Mythread (16, 20); 20 Mythread a5 = new Mythread (21, 25 ); 21 Mythread a6 = new Mythread (26, 30); 22 Mythread a7 = new Mythread (31, 35); 23 Mythread a8 = new Mythread (36, 40 ); 24 Mythread a9 = new Mythread (41, 45); 25 Mythread a10 = new Mythread (46, 47); 26 27 a1.start (); 28 a2.start (); 29 a3.start (); 30 a4.start (); 31 a5.start (); 32 a6.start (); 33 a7.start (); 34 a8.start (); 35 a9.start (); 36 a10.start (); 37 38 while (true) {39 if (a1.isAlive () = false & a2.isAlive () = false & a3.isAlive () = false & a4.isAlive () = false40 & a5.isAlive () = false & a6.isAlive () = false & a7.isAlive () = false & a8.isAlive () = false41 & a9.isAlive () = false & a10.isAlive () = false) {42 long endTime = System. currentTimeMillis (); // get the end time 43 System. out. println ("program running time:" + (endTime-startTime)/1000.0 + "s"); 44 break; 45} 46} 47} 48 49}
RUN

It takes 76.962 s to download the image to the local device.

Successful

Effect:

Very high

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.