Garbled Chinese Web pages in nutch2.0 + cassandra

Source: Internet
Author: User

When crawling and storing web pages using nutch2.0 + cassandra1.0, we found that all Chinese Characters in GBK-encoded web pages are garbled during parsing and extracting text. It's strange that Chinese garbled characters have never been detected in the past in nutch1.x, because the crawler code used by nutch1.x and nutch2.x is almost the same, I guess it may be a problem when it is saved to Cassandra. After reading the source code of saving the webpage to Cassandra, all the values to be saved are converted to binary and encapsulated into bytebuffer objects and uploaded to Gora for persistence. Check out the operations on Cassandra in Gora-Cassandra source code.

In cassandraclient. Java, AddColumn adds data. If the value is bytebuffer, it is converted to a string.

public void addColumn(String key, String fieldName, Object value) {    if (value == null) {      return;    }    if (value instanceof ByteBuffer) {      value = toString((ByteBuffer) value);    }        String columnFamily = this.cassandraMapping.getFamily(fieldName);    String columnName = this.cassandraMapping.getColumn(fieldName);        this.mutator.insert(key, columnFamily, HFactory.createStringColumn(columnName, value.toString()));  }

Code for converting byte into a string in byteutils. Java

public static String toString(final byte [] b, int off, int len) {    if(b == null) {      return null;    }    if(len == 0) {      return "";    }    String result = null;    try {      result = new String(b, off, len, "UTF-8");    } catch (UnsupportedEncodingException e) {      e.printStackTrace();    }    return result;  }

Pit dad, directly converted into a UTF-8 to save, that is to say, climb the GBK encoding page, it converted GBK encoding into a UTF-8 string, save to Cassandra, it would have been okay to convert it to UTF-8 in this way, but when it comes to the execution of the parsing in nutch, because the page Encoding Algorithm of nutch is more inclined to the encoding of the Request Header (if the request header does not exist, extract the file for calculation), and The charset = GBK at this time is also GBK encoding. It turns out that the data stored in Cassandra from UTF-8 is converted to GBK encoding, and it is not surprising that there are no garbled characters. If you know the reason, it is easy to solve the problem.

I was wondering why it was not stored directly in binary format, so I felt that the efficiency was high, and then I saw cassandraclient again. there is a todo comment in the tostring method in Java, saying that you should not convert the binary field from byte to string storage, which is not perfect yet.

/**   * TODO do no convert bytes to string to store a binary field   * @param value   * @return   */  private static String toString(ByteBuffer value) {    ByteBuffer byteBuffer = (ByteBuffer) value;    return ByteUtils.toString(byteBuffer.array(), 0, byteBuffer.limit());  }

So I went to git and checked the gora0.3 version of code. It was changed. Instead of directly converting it into string storage, I still wanted to solve it by myself. It seems I saved it again, the simplest solution is to change the Gora dependency library of nutch2.0 from 0.2 to 0.3.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.