A simple and rude Crawler

Last Update:2017-09-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A simple and rude Crawler
1. Bing today's meitu

When bing search is used, a pair of meitu appears every day.

Bing today's meitu http://bing.plmeizi.com/(here collected more than a year of today's meitu) collectors: http://leil.plmeizi.com/

47 pages

Url format by http://bing.plmeizi.com /? Page = *

Click it to enter the name and name we want.

2. Start Encoding

Using simple Jsoup for crawling is very simple and easy to understand.

HrmlUtil

1 package util; 2 3 import java. io. IOException; 4 5 import org. jsoup. jsoup; 6 import org. jsoup. nodes. document; 7 8 public class HtmlUtil {9 // obtain the webpage text 10 public Document getHtmlTextByUrl (String url) from the url {11 Document doc = null; 12 try {13 // doc = Jsoup. connect (url ). timeout (5000000 ). get (); 14 int I = (int) (Math. random () * 1000); // creates a random latency to prevent websites from blocking 15 while (I! = 0) {16 I --; 17} 18 doc = Jsoup. connect (url ). data ("query", "Java "). userAgent ("Mozilla "). cookie ("auth", "token "). timeout (300000) 19. post (); 20} catch (IOException e) {21 e. printStackTrace (); 22 try {23 doc = Jsoup. connect (url ). timeout (5000000 ). get (); 24} catch (IOException e1) {25 // TODO Auto-generated catch block26 e1.printStackTrace (); 27} 28} 29 return doc; 30} 31}

GetPhoto

This encoding mainly needs to analyze the html attributes and obtain the elements and element values according to the attributes.

I first get the page url of each graph.

Go to the details page and get the image url. The image name is truncated.

Save the graph to a local device.

1 package bing; 2 3 import java. io. dataInputStream; 4 import java. io. file; 5 import java. io. fileOutputStream; 6 import java. io. IOException; 7 import java.net. URL; 8 9 import org. jsoup. nodes. document; 10 import org. jsoup. nodes. element; 11 import org. jsoup. select. elements; 12 13 import util. htmlUtil; 14 15/** 16*17 * @ author loveincode18 * @ data Sep 29,201 7 1:15:00 PM19 */20 public class GetPhoto {21 2 2 public static void go (int startpage, int endpage) throws IOException {23 24 HtmlUtil htmlutil = new HtmlUtil (); 25 // get the absolute path of the image 26 String url = "http://bing.plmeizi.com /? Page = "; 27 for (int I = startpage; I <= endpage; I ++) {28 String gourl = url + I +" "; 29 Document dochtml = htmlutil. getHtmlTextByUrl (gourl); 30 Elements elements_a = dochtml. getElementsByClass ("item"); 31 for (int x = 0; x <elements_a.size (); x ++) {32 String pyotopage = elements_a.get (x ). attr ("href"); 33 Document dochtml_photo = htmlutil. getHtmlTextByUrl (pyotopage); 34 Element elements_picurl = require ("picurl"); 35 String picurl = elements_picurl.attr ("href"); 36 Element elements_searchlink = require ("searchlink "); 37 String name = elements_searchlink.getElementsByTag ("span" (.get(0).html (); 38 name = name. split ("\ (") [0]; 39 40 if (picurl. contains ("jpg") {41 // download image 42 URL url_pic = new URL (picurl); 43 DataInputStream dataInputStream = new DataInputStream (url_pic.openStream ()); 44 String imageName = name + ". jpg "; 45 FileOutputStream fileOutputStream = new FileOutputStream (new File (" bing_pic/"+ imageName); 46 byte [] buffer = new byte [1024]; 47 int length; 48 while (length = dataInputStream. read (buffer)> 0) {49 fileOutputStream. write (buffer, 0, length); 50} 51 dataInputStream. close (); 52 fileOutputStream. close (); 53} 54} 55} 56 57} 58 59 public static void main (String [] args) throws IOException {60 System. out. println ("test"); 61 go (1, 1); 62} 63 64}

Mythread

 1 package bing; 2  3 import java.io.IOException; 4  5 public class Mythread extends Thread { 6  7     private int startpage; 8  9     private int endpage;10 11     public Mythread(int startpage, int endpage) {12         this.startpage = startpage;13         this.endpage = endpage;14     }15 16     @SuppressWarnings("static-access")17     @Override18     public void run() {19         GetPhoto getPhoto = new GetPhoto();20         try {21             getPhoto.go(startpage, endpage);22         } catch (IOException e) {23             // TODO Auto-generated catch block24             e.printStackTrace();25         }26     }27 }RUN

RUN

Multithreading is used to enable multiple threads to crawl images simultaneously

1 package bing; 2 3 import java. io. IOException; 4 5/** 6*7 * @ author loveincode 8 * @ data Sep 29,201 7 1:55:57 PM 9 */10 public class RUN {11 12 public static void main (String [] args) throws IOException {13 14 long startTime = System. currentTimeMillis (); // get start time 15 16 Mythread a1 = new Mythread (1, 5); 17 Mythread a2 = new Mythread (6, 10 ); 18 Mythread a3 = new Mythread (11, 15); 19 Mythread a4 = new Mythread (16, 20); 20 Mythread a5 = new Mythread (21, 25 ); 21 Mythread a6 = new Mythread (26, 30); 22 Mythread a7 = new Mythread (31, 35); 23 Mythread a8 = new Mythread (36, 40 ); 24 Mythread a9 = new Mythread (41, 45); 25 Mythread a10 = new Mythread (46, 47); 26 27 a1.start (); 28 a2.start (); 29 a3.start (); 30 a4.start (); 31 a5.start (); 32 a6.start (); 33 a7.start (); 34 a8.start (); 35 a9.start (); 36 a10.start (); 37 38 while (true) {39 if (a1.isAlive () = false & a2.isAlive () = false & a3.isAlive () = false & a4.isAlive () = false40 & a5.isAlive () = false & a6.isAlive () = false & a7.isAlive () = false & a8.isAlive () = false41 & a9.isAlive () = false & a10.isAlive () = false) {42 long endTime = System. currentTimeMillis (); // get the end time 43 System. out. println ("program running time:" + (endTime-startTime)/1000.0 + "s"); 44 break; 45} 46} 47} 48 49}

RUN

It takes 76.962 s to download the image to the local device.

Successful

Effect:

Very high

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple and rude Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A simple and rude Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support