Nutch2.1 extends the storage layer through Gora, optionally using any of HBase, Accumulo, Cassandra, MySQL, Datafileavrostore, Avrostore to store data, but some of them are immature. In my repeated tests found that, overall, Nutch2.1 than Nutch1.6 performance is much worse, the most important thing is not long-term stable operation. Nutch1.6 uses Hadoop distributed File System (HDFS) as a storage, stable and reliable. Here are a few different ways to store each:
HBase (column stores), which supports input segmentation with region as the smallest partition unit. With the increase of data scale, the advantage of parallel processing is embodied, so it is suitable for large data application. However, the maintenance of hbase cluster is a big problem, more complex than HDFs, memory consumption is also very scary.
Accumulo (Key/value Store) after grasping 3 rounds of abnormal exit, prompted Unsupportedoperationexception.
Cassandra (column stores), it should be noted that localhost in/etc/hosts cannot be mapped to 127.0.0.1. The biggest problem with Cassandra is that it does not support input segmentation, even if the data scale is large and only one map, completely lose the parallelism.
MySQL (RDBMS), only one server as the data source, then as the size of the data, how will MySQL face it? So MySQL for small scale simple vertical search and so on the application is more appropriate.
Datafileavrostore (data serialization system), injection job throws NullPointerException, see https://issues.apache.org/ jira/browse/nutch-1477.
Avrostore (data serialization system), and datafileavrostore the same problem.
From the above analysis, the current Gora also need to improve. For the pursuit of the ultimate performance of friends, nutch2.1 is not stable, the proposed use of nutch1.6, using HDFS and MapReduce data localization and natural parallelism, can be optimized to very fast speed.
Test record __nutch2.1.zip (475.3 KB)
Author: Iteye Yangshangchuan
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/tools/