Solutions for garbled Chinese characters in the storage path of Heritrix Images

Last Update:2013-12-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Heritrix is used as a web crawler. When you choose to store documents under heritrix in image mode, if the URL contains Chinese characters or the accessed file name is Chinese, garbled characters are displayed in the image directory of the downloaded file.

Before solving this problem, Let's see why Garbled text occurs.

Take the Chinese Cultural Relics network as an example. The following path contains images, as shown below:

Http://www.wenwuchina.com/uploads/conew_ ____ conew1.jpg

When the address is entered in the browser, the browser will encode it as follows

Bytes.

The red part is the path encoded by the browser for Chinese characters.

The shard file name is used to store the resource.

The solution is to encode the path name when creating the path. The main code is the LumpyString method under the org. archive. crawler. writer. javaswriterprocessor class.

To respect the source code, I did not change the original method. I created the org. archive. crawler. writer. javaswriterforwenwuchinaprocessor class to extend heritrix. Copied all the code in the org. archive. crawler. writer. javaswriterprocessor class and made necessary changes to the LumpyString. As follows (Red is the modified part)

The Code is as follows:

Copy code

LumpyString (String str, int beginIndex, int endIndex, int padding,
Int maxLen, Map characterMap, String dotBegin ){
If (beginIndex <0 ){
Throw new IllegalArgumentException ("beginIndex <0 :"
+ BeginIndex );
}
If (endIndex <beginIndex ){
Throw new IllegalArgumentException ("endIndex <beginIndex"
+ "BeginIndex:" + beginIndex + "endIndex:" + endIndex );
}
If (padding <0 ){
Throw new IllegalArgumentException ("padding <0:" + padding );
}
If (maxLen <1 ){
Throw new IllegalArgumentException ("maxLen <1:" + maxLen );
}
If (null = characterMap ){
Throw new IllegalArgumentException ("characterMap null ");
}
If (null! = DotBegin) & (0 = dotBegin. length ())){
Throw new IllegalArgumentException ("dotBegin empty ");
}

// Initial capacity. Leave some room for % XX lumps.
// Guaranteed positive.
Int cap = Math. min (2 * (endIndex-beginIndex) + padding + 1,
MaxLen );
String = new StringBuffer (cap );
Aux = new byte [cap];
For (int I = beginIndex; I! = EndIndex; ++ I ){
String s = str. substring (I, I + 1 );
Try {
S = new String (s. getBytes (), "GB2312 ");
} Catch (UnsupportedEncodingException e ){
// TODO Auto-generated catch block
E. printStackTrace ();
}
String lump; // Next lump.
If (".". equals (s) & (I = beginIndex) & (null! = DotBegin )){
Lump = dotBegin;
} Else {
Lump = (String) characterMap. get (s );
}
If (null = lump ){
If ("%". equals (s) & (endIndex-I)> 2)
& (-1! = Character. digit (str. charAt (I + 1), 16 ))
& (-1! = Character. digit (str. charAt (I + 2), 16 ))){

// % XX escape; treat as one lump.
Lump = str. substring (I, I + 3 );
I + = 2;
} Else {
Lump = s;
}
}
If (string. length () + lump. length ()> maxLen ){
Assert checkInvariants ();
Return;
}
Append (lump );
}
Assert checkInvariants ();
}

Then

Add org. archive. crawler. writer. MirrorWriterForWenwuchinaProcessor to Processor. Add the Processor to the Job and the garbled characters will disappear. (For example)

PS: Garbled characters occur on Chinese resources and there are many solutions on the network. You can check them by yourself. It is also very simple. You only need to change a line of code.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Solutions for garbled Chinese characters in the storage path of Heritrix Images

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Solutions for garbled Chinese characters in the storage path of Heritrix Images

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support