1. Purpose
If you encounter words that won't be split, you have to replace them with pinyin. However, in this case, we found five reverse lookup tools on the Internet. Finally, I found a good website-not only has five codes corresponding to each word, but also has its root map. Unfortunately, this is a website. In other words, the Internet is required for each query. Naturally, you may want to save the five codes and the corresponding word root map on the website to your local computer, and then write a query program to make it a local version >_<
2. Preparations-Web Page Feature Analysis
A website (http://www.wb86.com/wbcx) provides two ways to query: one is to enter the word you want to query, and the other is to view one page after another. Because you are too reluctant to find a font, you chose the second method. In this way, the URL on the first page is a http://www.wb86.com/wbcx/index5.asp? Page = 1, the second page url is http://www.wb86.com/wbcx/index5.asp? Page = 2, the third page url is http://www.wb86.com/wbcx/index5.asp? Page = 3. Through the URL of the first three pages, is there a reason to believe that the URL of page X is a http://www.wb86.com/wbcx/index5.asp? Page = x.
After solving the URL problem, we need to analyze how to obtain the required resources from a single web page. Looking at the source code on the first page, we found that "86 five encodings" only appeared once and followed by the five required codes. Therefore, after obtaining the content sent back by the server, locate "86 five encodings" to get the corresponding five codes. The URL address of the font root chart appears after five codes and starts with "http://www.wb86.com/gif-82133. Therefore, in the content after "86 five encodings", find the address of the first URL starting with "http://www.wb86.com/gif-82133.
3. algorithm flow
For (first to last ){
Obtain the source code of this page
Extract five codes from the source code and the URL of the source code.
Get the font root chart
}
4. Source Code
Import java. AWT. image. bufferedimage;
Import java. Io. file;
Import java. Io. filewriter;
Import java. Io. ioexception;
Import java. Io. inputstream;
Import java.net. url;
Import java. util. Collections list;
Import javax. ImageIO. ImageIO;
Import org. Apache. http. httpentity;
Import org. Apache. http. httpresponse;
Import org. Apache. http. Client. httpclient;
Import org. Apache. http. Client. Methods. httpget;
Import org. Apache. http. Client. Methods. httpurirequest;
Import org. Apache. http. impl. Client. defaulthttpclient;
Public class clawler {
Private Static final int endpoints = 6764;
Private Static final string prefix = "http://www.wb86.com/wbcx/index5.asp? Page = ";
Private Static final string code_save_path = "d :\\ wubi \ wubicode.txt ";
Private Static final string img_save_path_prefix = "D: \ wubi \ IMG \\";
Private Static queue list queue = new queue list ();
Private Static string m_imguri;
Public static void main (string [] ARGs) throws ioexception {
Httpclient = new defaulthttpclient ();
Filewriter fw = NULL;
FW = new filewriter (code_save_path );
For (INT I = 1; I <= end_page; ++ I ){
Httpurirequest request = new httpget (prefix + I );
Try {
Httpresponse response = httpclient.exe cute (request );
Httpentity entity = response. getentity ();
Stringbuilder builder = new stringbuilder ();
If (entity! = NULL ){
Inputstream is = entity. getcontent ();
Byte [] TMP = new byte [1, 2048];
While (is. Read (TMP )! =-1 ){
Builder. append (new string (TMP ));
}
FW. Write (getwubicode (builder. tostring (), I ));
Downloadimg (m_imguri, img_save_path_prefix + I + ". GIF", I );
}
}
Catch (exception e ){
Queue. addlast (integer) I );
E. printstacktrace ();
}
If (I % 100 = 0 ){
FW. Flush ();
}
}
System. Out. println ("\ n missing Code ");
While (! Queue. isempty () {// download failure page
System. Out. println (queue. element ());
Queue. removefirst ();
}
System. Out. Print ("all done ");
FW. Close ();
Httpclient. getconnectionmanager (). Shutdown ();
}
Public static string getwubicode (string page, int number) {// extract five codes, the URL of the source Image
Stringbuilder save = new stringbuilder ();
Page = page. substring (page. indexof ("86 five encodings "));
Int Index = 7;
While (page. charat (INDEX )! = '<') Save. append (page. charat (index ++ ));
Save. append (system. getproperty ("line. separator "));
Index = 0;
Stringbuilder imgpath = new stringbuilder ();
Page = page. substring (page. indexof ("http://www.wb86.com/GIF-82 "));
While (page. charat (INDEX )! = '\ "') Imgpath. append (page. charat (index ++ ));
M_imguri = imgpath. tostring ();
Save. insert (0, imgpath. charat (imgpath. Length ()-5 ));
Save. insert (1 ,'');
Return save. tostring ();
}
Public static void downloadimg (string URL, string path, int number) {// download the image
Try {
File out = new file (PATH );
Bufferedimage buffer = ImageIO. Read (new URL (URL ));
If (buffer = NULL ){
Queue. addlast (number );
System. Out. println (number + "" + URL );
}
Else {
ImageIO. Write (buffer, "GIF", out );
}
}
Catch (ioexception e ){
Queue. addlast (number );
System. Out. println (URL );
System. Out. println (E. getmessage ());
}
}
}
5. References
A. httpclient4.1 getting started tutorial (Chinese Version)
Http://wenku.baidu.com/view/0a027c5e804d2b160b4ec029.html
B. Implementation of Forum image Crawlers
Http://www.iteye.com/topic/1044289
C. The simplest Search Engine Concept
Http://www.iteye.com/topic/1055424