Heritrix is used as a web crawler. When you choose to store documents under heritrix in image mode, if the URL contains Chinese characters or the accessed file name is Chinese, garbled characters are displayed in the image directory of the downloaded file.
Before solving this problem, Let's see why Garbled text occurs.
Take the Chinese Cultural Relics network as an example. The following path contains images, as shown below:
Http://www.wenwuchina.com/uploads/conew_ ____ conew1.jpg
When the address is entered in the browser, the browser will encode it as follows
Bytes.
The red part is the path encoded by the browser for Chinese characters.
The shard file name is used to store the resource.
The solution is to encode the path name when creating the path. The main code is the LumpyString method under the org. archive. crawler. writer. javaswriterprocessor class.
To respect the source code, I did not change the original method. I created the org. archive. crawler. writer. javaswriterforwenwuchinaprocessor class to extend heritrix. Copied all the code in the org. archive. crawler. writer. javaswriterprocessor class and made necessary changes to the LumpyString. As follows (Red is the modified part)
The Code is as follows: |
Copy code |
LumpyString (String str, int beginIndex, int endIndex, int padding, Int maxLen, Map characterMap, String dotBegin ){ If (beginIndex <0 ){ Throw new IllegalArgumentException ("beginIndex <0 :" + BeginIndex ); } If (endIndex <beginIndex ){ Throw new IllegalArgumentException ("endIndex <beginIndex" + "BeginIndex:" + beginIndex + "endIndex:" + endIndex ); } If (padding <0 ){ Throw new IllegalArgumentException ("padding <0:" + padding ); } If (maxLen <1 ){ Throw new IllegalArgumentException ("maxLen <1:" + maxLen ); } If (null = characterMap ){ Throw new IllegalArgumentException ("characterMap null "); } If (null! = DotBegin) & (0 = dotBegin. length ())){ Throw new IllegalArgumentException ("dotBegin empty "); } // Initial capacity. Leave some room for % XX lumps. // Guaranteed positive. Int cap = Math. min (2 * (endIndex-beginIndex) + padding + 1, MaxLen ); String = new StringBuffer (cap ); Aux = new byte [cap]; For (int I = beginIndex; I! = EndIndex; ++ I ){ String s = str. substring (I, I + 1 ); Try { S = new String (s. getBytes (), "GB2312 "); } Catch (UnsupportedEncodingException e ){ // TODO Auto-generated catch block E. printStackTrace (); } String lump; // Next lump. If (".". equals (s) & (I = beginIndex) & (null! = DotBegin )){ Lump = dotBegin; } Else { Lump = (String) characterMap. get (s ); } If (null = lump ){ If ("%". equals (s) & (endIndex-I)> 2) & (-1! = Character. digit (str. charAt (I + 1), 16 )) & (-1! = Character. digit (str. charAt (I + 2), 16 ))){ // % XX escape; treat as one lump. Lump = str. substring (I, I + 3 ); I + = 2; } Else { Lump = s; } } If (string. length () + lump. length ()> maxLen ){ Assert checkInvariants (); Return; } Append (lump ); } Assert checkInvariants (); } Then |
Add org. archive. crawler. writer. MirrorWriterForWenwuchinaProcessor to Processor. Add the Processor to the Job and the garbled characters will disappear. (For example)
PS: Garbled characters occur on Chinese resources and there are many solutions on the network. You can check them by yourself. It is also very simple. You only need to change a line of code.