Capture all specific images of a website (Java)

Source: Internet
Author: User
Tags wubi

1. Purpose

If you encounter words that won't be split, you have to replace them with pinyin. However, in this case, we found five reverse lookup tools on the Internet. Finally, I found a good website-not only has five codes corresponding to each word, but also has its root map. Unfortunately, this is a website. In other words, the Internet is required for each query. Naturally, you may want to save the five codes and the corresponding word root map on the website to your local computer, and then write a query program to make it a local version >_<

 

2. Preparations-Web Page Feature Analysis

A website (http://www.wb86.com/wbcx) provides two ways to query: one is to enter the word you want to query, and the other is to view one page after another. Because you are too reluctant to find a font, you chose the second method. In this way, the URL on the first page is a http://www.wb86.com/wbcx/index5.asp? Page = 1, the second page url is http://www.wb86.com/wbcx/index5.asp? Page = 2, the third page url is http://www.wb86.com/wbcx/index5.asp? Page = 3. Through the URL of the first three pages, is there a reason to believe that the URL of page X is a http://www.wb86.com/wbcx/index5.asp? Page = x.
After solving the URL problem, we need to analyze how to obtain the required resources from a single web page. Looking at the source code on the first page, we found that "86 five encodings" only appeared once and followed by the five required codes. Therefore, after obtaining the content sent back by the server, locate "86 five encodings" to get the corresponding five codes. The URL address of the font root chart appears after five codes and starts with "http://www.wb86.com/gif-82133. Therefore, in the content after "86 five encodings", find the address of the first URL starting with "http://www.wb86.com/gif-82133.

 

3. algorithm flow

For (first to last ){
Obtain the source code of this page
Extract five codes from the source code and the URL of the source code.
Get the font root chart
}

 

4. Source Code

Import java. AWT. image. bufferedimage;
Import java. Io. file;
Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. Io. inputstream;
Import java.net. url;
Import java. util. Collections list;

Import javax. ImageIO. ImageIO;

Import org. Apache. http. httpentity;
Import org. Apache. http. httpresponse;
Import org. Apache. http. Client. httpclient;
Import org. Apache. http. Client. Methods. httpget;
Import org. Apache. http. Client. Methods. httpurirequest;
Import org. Apache. http. impl. Client. defaulthttpclient;


Public class clawler {
Private Static final int endpoints = 6764;
Private Static final string prefix = "http://www.wb86.com/wbcx/index5.asp? Page = ";
Private Static final string code_save_path = "d :\\ wubi \ wubicode.txt ";
Private Static final string img_save_path_prefix = "D: \ wubi \ IMG \\";

Private Static queue list queue = new queue list ();

Private Static string m_imguri;

Public static void main (string [] ARGs) throws ioexception {
Httpclient = new defaulthttpclient ();

Filewriter fw = NULL;
FW = new filewriter (code_save_path );

For (INT I = 1; I <= end_page; ++ I ){
Httpurirequest request = new httpget (prefix + I );

Try {
Httpresponse response = httpclient.exe cute (request );

Httpentity entity = response. getentity ();
Stringbuilder builder = new stringbuilder ();
If (entity! = NULL ){
Inputstream is = entity. getcontent ();
Byte [] TMP = new byte [1, 2048];
While (is. Read (TMP )! =-1 ){
Builder. append (new string (TMP ));
}

FW. Write (getwubicode (builder. tostring (), I ));
Downloadimg (m_imguri, img_save_path_prefix + I + ". GIF", I );
}

}
Catch (exception e ){
Queue. addlast (integer) I );
E. printstacktrace ();
}

If (I % 100 = 0 ){
FW. Flush ();
}
}

System. Out. println ("\ n missing Code ");
While (! Queue. isempty () {// download failure page
System. Out. println (queue. element ());
Queue. removefirst ();
}

System. Out. Print ("all done ");
FW. Close ();
Httpclient. getconnectionmanager (). Shutdown ();
}

Public static string getwubicode (string page, int number) {// extract five codes, the URL of the source Image
Stringbuilder save = new stringbuilder ();
Page = page. substring (page. indexof ("86 five encodings "));

Int Index = 7;
While (page. charat (INDEX )! = '<') Save. append (page. charat (index ++ ));
Save. append (system. getproperty ("line. separator "));

Index = 0;
Stringbuilder imgpath = new stringbuilder ();
Page = page. substring (page. indexof ("http://www.wb86.com/GIF-82 "));
While (page. charat (INDEX )! = '\ "') Imgpath. append (page. charat (index ++ ));
M_imguri = imgpath. tostring ();

Save. insert (0, imgpath. charat (imgpath. Length ()-5 ));
Save. insert (1 ,'');

Return save. tostring ();
}

Public static void downloadimg (string URL, string path, int number) {// download the image
Try {
File out = new file (PATH );
Bufferedimage buffer = ImageIO. Read (new URL (URL ));
If (buffer = NULL ){
Queue. addlast (number );
System. Out. println (number + "" + URL );
}
Else {
ImageIO. Write (buffer, "GIF", out );
}
}
Catch (ioexception e ){
Queue. addlast (number );
System. Out. println (URL );
System. Out. println (E. getmessage ());
}
}
}

 

 

5. References
A. httpclient4.1 getting started tutorial (Chinese Version)
Http://wenku.baidu.com/view/0a027c5e804d2b160b4ec029.html
B. Implementation of Forum image Crawlers
Http://www.iteye.com/topic/1044289
C. The simplest Search Engine Concept
Http://www.iteye.com/topic/1055424

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.