Some views on the Nutch 2.1 abstract storage Layer

Source: Internet
Author: User
Tags abstract cassandra file system serialization value store accumulo

Nutch2.1 extends the storage layer through Gora, optionally using any of HBase, Accumulo, Cassandra, MySQL, Datafileavrostore, Avrostore to store data, but some of them are immature. In my repeated tests found that, overall, Nutch2.1 than Nutch1.6 performance is much worse, the most important thing is not long-term stable operation. Nutch1.6 uses Hadoop distributed File System (HDFS) as a storage, stable and reliable. Here are a few different ways to store each:

HBase (column stores), which supports input segmentation with region as the smallest partition unit. With the increase of data scale, the advantage of parallel processing is embodied, so it is suitable for large data application. However, the maintenance of hbase cluster is a big problem, more complex than HDFs, memory consumption is also very scary.

Accumulo (Key/value Store) after grasping 3 rounds of abnormal exit, prompted Unsupportedoperationexception.

Cassandra (column stores), it should be noted that localhost in/etc/hosts cannot be mapped to 127.0.0.1. The biggest problem with Cassandra is that it does not support input segmentation, even if the data scale is large and only one map, completely lose the parallelism.

MySQL (RDBMS), only one server as the data source, then as the size of the data, how will MySQL face it? So MySQL for small scale simple vertical search and so on the application is more appropriate.

Datafileavrostore (data serialization system), injection job throws NullPointerException, see https://issues.apache.org/ jira/browse/nutch-1477.

Avrostore (data serialization system), and datafileavrostore the same problem.

From the above analysis, the current Gora also need to improve. For the pursuit of the ultimate performance of friends, nutch2.1 is not stable, the proposed use of nutch1.6, using HDFS and MapReduce data localization and natural parallelism, can be optimized to very fast speed.

Test record __nutch2.1.zip (475.3 KB)

Author: Iteye Yangshangchuan

See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.