Heritrix is used as a web crawler. When you choose to store documents under heritrix in image mode, if the URL contains Chinese characters or the Accessed file name is Chinese, garbled characters are displayed in the image directory of the downloaded file (as shown in the figure below ).
Before solving this problem, let's see why garbled text occurs.
The solution is to encode the path name when creating the path. The main code is the LumpyString method under the org. archive. crawler. writer. Javaswriterprocessor class.
To respect the source code, I did not change the original method. I created the org. archive. crawler. writer. Javaswriterforwenwuchinaprocessor class to extend heritrix. Copied all the code in the org. archive. crawler. writer. Javaswriterprocessor class and made necessary changes to the LumpyString. As follows (red is the modified part)
The code is as follows: |
Copy code |
LumpyString (String str, int beginIndex, int endIndex, int padding, Int maxLen, Map characterMap, String dotBegin ){ If (beginIndex <0 ){ Throw new IllegalArgumentException ("beginIndex <0 :" + BeginIndex ); } If (endIndex <beginIndex ){ Throw new IllegalArgumentException ("endIndex <beginIndex" + "BeginIndex:" + beginIndex + "endIndex:" + endIndex ); } If (padding <0 ){ Throw new IllegalArgumentException ("padding <0:" + padding ); } If (maxLen <1 ){ Throw new IllegalArgumentException ("maxLen <1:" + maxLen ); } If (null = characterMap ){ Throw new IllegalArgumentException ("characterMap null "); } If (null! = DotBegin) & (0 = dotBegin. length ())){ Throw new IllegalArgumentException ("dotBegin empty "); } // Initial capacity. Leave some room for % XX lumps. // Guaranteed positive. Int cap = Math. min (2 * (endIndex-beginIndex) + padding + 1, MaxLen ); String = new StringBuffer (cap ); Aux = new byte [cap]; For (int I = beginIndex; I! = EndIndex; ++ I ){ String s = str. substring (I, I + 1 ); Try { S = new String (s. getBytes (), "GB2312 "); } Catch (UnsupportedEncodingException e ){ // TODO Auto-generated catch block E. printStackTrace (); } String lump; // Next lump. If (".". equals (s) & (I = beginIndex) & (null! = DotBegin )){ Lump = dotBegin; } Else { Lump = (String) characterMap. get (s ); } If (null = lump ){ If ("%". equals (s) & (endIndex-I)> 2) & (-1! = Character. digit (str. charAt (I + 1), 16 )) & (-1! = Character. digit (str. charAt (I + 2), 16 ))){ // % XX escape; treat as one lump. Lump = str. substring (I, I + 3 ); I + = 2; } Else { Lump = s; } } If (string. length () + lump. length ()> maxLen ){ Assert checkInvariants (); Return; } Append (lump ); } Assert checkInvariants (); } |