Capture all specific images of a website (Java)

Last Update:2018-12-06 Source: Internet

Author: User

Tags wubi

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Purpose

If you encounter words that won't be split, you have to replace them with pinyin. However, in this case, we found five reverse lookup tools on the Internet. Finally, I found a good website-not only has five codes corresponding to each word, but also has its root map. Unfortunately, this is a website. In other words, the Internet is required for each query. Naturally, you may want to save the five codes and the corresponding word root map on the website to your local computer, and then write a query program to make it a local version >_<

2. Preparations-Web Page Feature Analysis

A website (http://www.wb86.com/wbcx) provides two ways to query: one is to enter the word you want to query, and the other is to view one page after another. Because you are too reluctant to find a font, you chose the second method. In this way, the URL on the first page is a http://www.wb86.com/wbcx/index5.asp? Page = 1, the second page url is http://www.wb86.com/wbcx/index5.asp? Page = 2, the third page url is http://www.wb86.com/wbcx/index5.asp? Page = 3. Through the URL of the first three pages, is there a reason to believe that the URL of page X is a http://www.wb86.com/wbcx/index5.asp? Page = x.
After solving the URL problem, we need to analyze how to obtain the required resources from a single web page. Looking at the source code on the first page, we found that "86 five encodings" only appeared once and followed by the five required codes. Therefore, after obtaining the content sent back by the server, locate "86 five encodings" to get the corresponding five codes. The URL address of the font root chart appears after five codes and starts with "http://www.wb86.com/gif-82133. Therefore, in the content after "86 five encodings", find the address of the first URL starting with "http://www.wb86.com/gif-82133.

3. algorithm flow

For (first to last ){
Obtain the source code of this page
Extract five codes from the source code and the URL of the source code.
Get the font root chart
}

4. Source Code

Import java. AWT. image. bufferedimage;
Import java. Io. file;
Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. Io. inputstream;
Import java.net. url;
Import java. util. Collections list;

Import javax. ImageIO. ImageIO;

Import org. Apache. http. httpentity;
Import org. Apache. http. httpresponse;
Import org. Apache. http. Client. httpclient;
Import org. Apache. http. Client. Methods. httpget;
Import org. Apache. http. Client. Methods. httpurirequest;
Import org. Apache. http. impl. Client. defaulthttpclient;


Public class clawler {
Private Static final int endpoints = 6764;
Private Static final string prefix = "http://www.wb86.com/wbcx/index5.asp? Page = ";
Private Static final string code_save_path = "d :\\ wubi \ wubicode.txt ";
Private Static final string img_save_path_prefix = "D: \ wubi \ IMG \\";

Private Static queue list queue = new queue list ();

Private Static string m_imguri;

Public static void main (string [] ARGs) throws ioexception {
Httpclient = new defaulthttpclient ();

Filewriter fw = NULL;
FW = new filewriter (code_save_path );

For (INT I = 1; I <= end_page; ++ I ){
Httpurirequest request = new httpget (prefix + I );

Try {
Httpresponse response = httpclient.exe cute (request );

Httpentity entity = response. getentity ();
Stringbuilder builder = new stringbuilder ();
If (entity! = NULL ){
Inputstream is = entity. getcontent ();
Byte [] TMP = new byte [1, 2048];
While (is. Read (TMP )! =-1 ){
Builder. append (new string (TMP ));
}

FW. Write (getwubicode (builder. tostring (), I ));
Downloadimg (m_imguri, img_save_path_prefix + I + ". GIF", I );
}

}
Catch (exception e ){
Queue. addlast (integer) I );
E. printstacktrace ();
}

If (I % 100 = 0 ){
FW. Flush ();
}
}

System. Out. println ("\ n missing Code ");
While (! Queue. isempty () {// download failure page
System. Out. println (queue. element ());
Queue. removefirst ();
}

System. Out. Print ("all done ");
FW. Close ();
Httpclient. getconnectionmanager (). Shutdown ();
}

Public static string getwubicode (string page, int number) {// extract five codes, the URL of the source Image
Stringbuilder save = new stringbuilder ();
Page = page. substring (page. indexof ("86 five encodings "));

Int Index = 7;
While (page. charat (INDEX )! = '<') Save. append (page. charat (index ++ ));
Save. append (system. getproperty ("line. separator "));

Index = 0;
Stringbuilder imgpath = new stringbuilder ();
Page = page. substring (page. indexof ("http://www.wb86.com/GIF-82 "));
While (page. charat (INDEX )! = '\ "') Imgpath. append (page. charat (index ++ ));
M_imguri = imgpath. tostring ();

Save. insert (0, imgpath. charat (imgpath. Length ()-5 ));
Save. insert (1 ,'');

Return save. tostring ();
}

Public static void downloadimg (string URL, string path, int number) {// download the image
Try {
File out = new file (PATH );
Bufferedimage buffer = ImageIO. Read (new URL (URL ));
If (buffer = NULL ){
Queue. addlast (number );
System. Out. println (number + "" + URL );
}
Else {
ImageIO. Write (buffer, "GIF", out );
}
}
Catch (ioexception e ){
Queue. addlast (number );
System. Out. println (URL );
System. Out. println (E. getmessage ());
}
}
}

5. References
A. httpclient4.1 getting started tutorial (Chinese Version)
Http://wenku.baidu.com/view/0a027c5e804d2b160b4ec029.html
B. Implementation of Forum image Crawlers
Http://www.iteye.com/topic/1044289
C. The simplest Search Engine Concept
Http://www.iteye.com/topic/1055424

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More