Solutions for garbled Chinese characters in the storage path of Heritrix Images

Source: Internet
Author: User

Heritrix is used as a web crawler. When you choose to store documents under heritrix in image mode, if the URL contains Chinese characters or the accessed file name is Chinese, garbled characters are displayed in the image directory of the downloaded file.


Before solving this problem, Let's see why Garbled text occurs.

Take the Chinese Cultural Relics network as an example. The following path contains images, as shown below:

Http://www.wenwuchina.com/uploads/conew_ ____ conew1.jpg

When the address is entered in the browser, the browser will encode it as follows

Bytes.

The red part is the path encoded by the browser for Chinese characters.

The shard file name is used to store the resource.

The solution is to encode the path name when creating the path. The main code is the LumpyString method under the org. archive. crawler. writer. javaswriterprocessor class.

To respect the source code, I did not change the original method. I created the org. archive. crawler. writer. javaswriterforwenwuchinaprocessor class to extend heritrix. Copied all the code in the org. archive. crawler. writer. javaswriterprocessor class and made necessary changes to the LumpyString. As follows (Red is the modified part)

The Code is as follows: Copy code

LumpyString (String str, int beginIndex, int endIndex, int padding,
Int maxLen, Map characterMap, String dotBegin ){
If (beginIndex <0 ){
Throw new IllegalArgumentException ("beginIndex <0 :"
+ BeginIndex );
}
If (endIndex <beginIndex ){
Throw new IllegalArgumentException ("endIndex <beginIndex"
+ "BeginIndex:" + beginIndex + "endIndex:" + endIndex );
}
If (padding <0 ){
Throw new IllegalArgumentException ("padding <0:" + padding );
}
If (maxLen <1 ){
Throw new IllegalArgumentException ("maxLen <1:" + maxLen );
}
If (null = characterMap ){
Throw new IllegalArgumentException ("characterMap null ");
}
If (null! = DotBegin) & (0 = dotBegin. length ())){
Throw new IllegalArgumentException ("dotBegin empty ");
}
 
// Initial capacity. Leave some room for % XX lumps.
// Guaranteed positive.
Int cap = Math. min (2 * (endIndex-beginIndex) + padding + 1,
MaxLen );
String = new StringBuffer (cap );
Aux = new byte [cap];
For (int I = beginIndex; I! = EndIndex; ++ I ){
String s = str. substring (I, I + 1 );
Try {
S = new String (s. getBytes (), "GB2312 ");
} Catch (UnsupportedEncodingException e ){
// TODO Auto-generated catch block
E. printStackTrace ();
}
String lump; // Next lump.
If (".". equals (s) & (I = beginIndex) & (null! = DotBegin )){
Lump = dotBegin;
} Else {
Lump = (String) characterMap. get (s );
}
If (null = lump ){
If ("%". equals (s) & (endIndex-I)> 2)
& (-1! = Character. digit (str. charAt (I + 1), 16 ))
& (-1! = Character. digit (str. charAt (I + 2), 16 ))){
 
// % XX escape; treat as one lump.
Lump = str. substring (I, I + 3 );
I + = 2;
} Else {
Lump = s;
}
}
If (string. length () + lump. length ()> maxLen ){
Assert checkInvariants ();
Return;
}
Append (lump );
}
Assert checkInvariants ();
}

Then

Add org. archive. crawler. writer. MirrorWriterForWenwuchinaProcessor to Processor. Add the Processor to the Job and the garbled characters will disappear. (For example)

 

PS: Garbled characters occur on Chinese resources and there are many solutions on the network. You can check them by yourself. It is also very simple. You only need to change a line of code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.