Some views on the Nutch2.1 abstract storage Layer-

Source: Internet
Author: User
Keywords View some so for only

Nutch2.1 extends the storage layer through Gora, supporting Http://www.aliyun.com/zixun/aggregation/13713.html ">hbase, Accumulo, Cassandra, MySQL, Datafileavrostore, Avrostore and other storage methods. In my repeated tests found that Nutch2.1 than 1.6 of the performance is much worse, the most important thing is not long-term stable operation. Here are a few different ways to store each:

HBase, the input segmentation is supported, and the region is the smallest partition unit. With the increase of data scale, the advantage of parallel processing is embodied, so it is suitable for large data application. However, the maintenance of hbase cluster is a big problem, more complex than HDFs, memory consumption is also very scary.

Accumulo crawl after 3 rounds of abnormal exit, prompted Unsupportedoperationexception.

Cassandra, it should be noted that hosts localhost cannot be mapped to 127.0.0.1. The biggest problem with Cassandra is that it does not support input segmentation, even if the data scale is large and only one map, completely lose the parallelism.

MySQL, only one server as a data source, then as the size of the data, how to face MySQL? So MySQL for small-scale simple vertical search and other applications are more appropriate.

Datafileavrostore,injection job throws NullPointerException, see https://issues.apache.org/jira/browse/NUTCH-1477.

Avrostore, and Datafileavrostore the same problem.

From the above analysis, the current Gora also need to improve. For the pursuit of the ultimate performance of friends, nutch2.1 is not stable, the proposed use of nutch1.6, using HDFS and MapReduce data localization and natural parallelism, can be optimized to very fast speed.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.