Solutions for garbled Chinese paths in Heritrix images

Source: Internet
Author: User
Tags assert

Heritrix is used as a web crawler. When you choose to store documents under heritrix in image mode, if the URL contains Chinese characters or the Accessed file name is Chinese, garbled characters are displayed in the image directory of the downloaded file (as shown in the figure below ).

 

Before solving this problem, let's see why garbled text occurs.

The solution is to encode the path name when creating the path. The main code is the LumpyString method under the org. archive. crawler. writer. Javaswriterprocessor class.

To respect the source code, I did not change the original method. I created the org. archive. crawler. writer. Javaswriterforwenwuchinaprocessor class to extend heritrix. Copied all the code in the org. archive. crawler. writer. Javaswriterprocessor class and made necessary changes to the LumpyString. As follows (red is the modified part)

The code is as follows: Copy code

LumpyString (String str, int beginIndex, int endIndex, int padding,
Int maxLen, Map characterMap, String dotBegin ){
If (beginIndex <0 ){
Throw new IllegalArgumentException ("beginIndex <0 :"
+ BeginIndex );
             }
If (endIndex <beginIndex ){
Throw new IllegalArgumentException ("endIndex <beginIndex"
+ "BeginIndex:" + beginIndex + "endIndex:" + endIndex );
             }
If (padding <0 ){
Throw new IllegalArgumentException ("padding <0:" + padding );
             }
If (maxLen <1 ){
Throw new IllegalArgumentException ("maxLen <1:" + maxLen );
             }
If (null = characterMap ){
Throw new IllegalArgumentException ("characterMap null ");
             }
If (null! = DotBegin) & (0 = dotBegin. length ())){
Throw new IllegalArgumentException ("dotBegin empty ");
             }
 
// Initial capacity. Leave some room for % XX lumps.
// Guaranteed positive.
Int cap = Math. min (2 * (endIndex-beginIndex) + padding + 1,
MaxLen );
String = new StringBuffer (cap );
Aux = new byte [cap];
For (int I = beginIndex; I! = EndIndex; ++ I ){
String s = str. substring (I, I + 1 );
Try {
S = new String (s. getBytes (), "GB2312 ");
} Catch (UnsupportedEncodingException e ){
// TODO Auto-generated catch block
E. printStackTrace ();
                 }
String lump; // Next lump.
If (".". equals (s) & (I = beginIndex) & (null! = DotBegin )){
Lump = dotBegin;
} Else {
Lump = (String) characterMap. get (s );
                 }
If (null = lump ){
If ("%". equals (s) & (endIndex-I)> 2)
& (-1! = Character. digit (str. charAt (I + 1), 16 ))
& (-1! = Character. digit (str. charAt (I + 2), 16 ))){
 
// % XX escape; treat as one lump.
Lump = str. substring (I, I + 3 );
I + = 2;
} Else {
Lump = s;
                     }
                 }
If (string. length () + lump. length ()> maxLen ){
Assert checkInvariants ();
Return;
                 }
Append (lump );
             }
Assert checkInvariants ();
         }

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.