Debugging of nutch2.0 + Cassandra in eclipse

Last Update:2018-12-03 Source: Internet

Author: User

Tags cassandra solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Very early, the official company started the development of nutch2.0, which has been both developed at the same time. One is the normal version, the other is the Gora version, that is, the nutch2.0. Next we will introduce how to import the project to eclipse. Here, our storage layer uses nosql Cassandra. I wanted to try MySQL first and found that the crawler cannot be started, after debugging, it is found that Gora's SQL database storage function has not been fully implemented, So Cassandra is easy to use for testing.

Knowledge required: Basic knowledge of nutch, basic knowledge of Cassandra, use Maven to manage projects, and use git to manage and download projects.

Tools Required: Install eclipse with Maven plug-in (the plug-in can be downloaded from the market place of eclipse)

1. Download the import Project

Download the nutch2.0project from https://github.com/apache/nutch/tree/release-2.0( the Windows zip button will be packaged and downloaded)

Import the project in eclipse (file-import-Maven-existing Maven project)

2. Add dependency

After being imported, src/Java and src/test are already in the source file directory.

/Conf

/Src/plugin/protocol-httpclient/src/Java

/Src/plugin/urlfilter-domain/src/Java

/Src/plugin/lib-HTTP/src/Java

/Src/plugin/protocol-HTTP/src/Java

/Src/plugin/urlfilter-suffix/src/Java

/Src/plugin/urlfilter-RegEx/src/Java

/Src/plugin/lib-RegEx-filter/src/Java

/Src/plugin/urlnormalizer-Basic/src/Java

/Src/plugin/urlnormalizer-pass

/Src/javasrc/plugin/urlnormalizer-RegEx/src/Java

/Src/plugin/scoring-OPIC/src/Java

/Src/plugin/parse-html/src/Java

These basic plug-ins are added to classpath, and additional jar packages are also required for parse-HTML,

And the nekohtml and tagsoup jar packages. We will add it in the POM below.

3. add an additional jar package to the POM File

Add the following dependencies to the Pom. xml file and remove the original Gora-core and Gora-SQL dependencies.

 <dependency>                        <groupId>org.apache.gora</groupId>                        <artifactId>gora-core</artifactId>                        <version>0.2</version>                        <optional>true</optional>                </dependency>                                            <dependency>                        <groupId>org.apache.gora</groupId>                        <artifactId>gora-cassandra</artifactId>                        <version>0.2</version>                        <optional>true</optional>                </dependency><!-- html parser dependency --><dependency><groupId>net.sourceforge.nekohtml</groupId><artifactId>nekohtml</artifactId><version>1.9.15</version></dependency><dependency><groupId>org.ccil.cowan.tagsoup</groupId><artifactId>tagsoup</artifactId><version>1.2</version></dependency>

If you cannot download the Gora package, go to http://gora.apache.org/releases.html#downloadand download it to the mavenrepository.

4. modify the configuration file

Remove all files in the conf directory that end with a template, such as nutch-site.xml.template to nutch-site.xml to prevent some plug-ins from finding the configuration file at runtime.

Modify the nutch-site.xml under conf to add the setting (the value in it is as you enter ):

<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <Configuration> <property> <Name> HTTP. agent. name </Name> <value> your agent name </value> </property> <Name> HTTP. agent. description </Name> <value> EIP </value> </property> <Name> HTTP. agent. URL </Name> <value> your HTTP Agent URL </value> </property> <Name> HTTP. agent. email </Name> <value> your HTTP agent email </value> </property> <Name> storage. data. store. class </Name> <value> Org. apache. gora. cassandra. store. cassandrastore </value> <description> sets the Gora storage layer implementation class, which can be set to the following parameter: Relational Database: Org. apache. gora. SQL. store. sqlstore CASSANDRA: Org. apache. gora. cassandra. store. cassandrastore habse: Org. apache. gora. hbase. store. hbasestore accumulo: Org. apache. gora. hbase. store. accumulostore Avro: Org. apache. gora. hbase. store. avrostore File Format: Org. apache. gora. hbase. store. datafileavrostore in memory: Org. apache. gora. hbase. store. memstore </description> </property> </configuration>

Modify/nutch/CONF/Gora. properties under Conf

Comment out the configuration information of the relational database:

# Gora. sqlstore. JDBC. Driver = org. HSQLDB. JDBC. jdbcdriver
# Gora. sqlstore. JDBC. url = JDBC: HSQLDB: // localhost/test
# Gora. sqlstore. JDBC. User = sa
# Gora. sqlstore. JDBC. Password =

Remove the comments of Gora. cassandrastore. servers = localhost: 9160, indicating that Cassandra is used as the storage layer.

5. Start cassandra

Start Cassandra. For details about how to start Cassandra, refer to restart.

6. Execute Crawler

Create a URLs folder under the root directory, create a TXT file under the folder, and enter a few websites, such as http://www.163.com /.

Finally, run/nutch/src/Java/org/Apache/nutch/Crawl/crawler with Java application. java parameter: the URL-depth 2 crawler is executed, and the crawled webpage is saved to Cassandra.

The same as the index nutch2.0 and nutch1.3 +, SOLR is used as a search program. Therefore, the usage is the same as that of nutch1.3 +. I will not introduce it here. I am used to elasticsearch. Now I really think SOLR is too troublesome, configuration is too lazy to be configured for debugging. You can directly change the source code and add the method for indexing the nutch document to elasticsearch.

References: http://www.searchtech.pro/articles/2013/02/18/1361191389790.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More