Debugging of nutch2.0 + Cassandra in eclipse

Source: Internet
Author: User
Tags cassandra solr

Very early, the official company started the development of nutch2.0, which has been both developed at the same time. One is the normal version, the other is the Gora version, that is, the nutch2.0. Next we will introduce how to import the project to eclipse. Here, our storage layer uses nosql Cassandra. I wanted to try MySQL first and found that the crawler cannot be started, after debugging, it is found that Gora's SQL database storage function has not been fully implemented, So Cassandra is easy to use for testing.

Knowledge required: Basic knowledge of nutch, basic knowledge of Cassandra, use Maven to manage projects, and use git to manage and download projects.

Tools Required: Install eclipse with Maven plug-in (the plug-in can be downloaded from the market place of eclipse)

1. Download the import Project

Download the nutch2.0project from https://github.com/apache/nutch/tree/release-2.0( the Windows zip button will be packaged and downloaded)

Import the project in eclipse (file-import-Maven-existing Maven project)

2. Add dependency

After being imported, src/Java and src/test are already in the source file directory.

/Conf

/Src/plugin/protocol-httpclient/src/Java

/Src/plugin/urlfilter-domain/src/Java

/Src/plugin/lib-HTTP/src/Java

/Src/plugin/protocol-HTTP/src/Java

/Src/plugin/urlfilter-suffix/src/Java

/Src/plugin/urlfilter-RegEx/src/Java

/Src/plugin/lib-RegEx-filter/src/Java

/Src/plugin/urlnormalizer-Basic/src/Java

/Src/plugin/urlnormalizer-pass

/Src/javasrc/plugin/urlnormalizer-RegEx/src/Java

/Src/plugin/scoring-OPIC/src/Java

/Src/plugin/parse-html/src/Java

These basic plug-ins are added to classpath, and additional jar packages are also required for parse-HTML,

And the nekohtml and tagsoup jar packages. We will add it in the POM below.

3. add an additional jar package to the POM File

Add the following dependencies to the Pom. xml file and remove the original Gora-core and Gora-SQL dependencies.

 <dependency>                        <groupId>org.apache.gora</groupId>                        <artifactId>gora-core</artifactId>                        <version>0.2</version>                        <optional>true</optional>                </dependency>                                            <dependency>                        <groupId>org.apache.gora</groupId>                        <artifactId>gora-cassandra</artifactId>                        <version>0.2</version>                        <optional>true</optional>                </dependency><!-- html parser dependency --><dependency><groupId>net.sourceforge.nekohtml</groupId><artifactId>nekohtml</artifactId><version>1.9.15</version></dependency><dependency><groupId>org.ccil.cowan.tagsoup</groupId><artifactId>tagsoup</artifactId><version>1.2</version></dependency>

If you cannot download the Gora package, go to http://gora.apache.org/releases.html#downloadand download it to the mavenrepository.

4. modify the configuration file

Remove all files in the conf directory that end with a template, such as nutch-site.xml.template to nutch-site.xml to prevent some plug-ins from finding the configuration file at runtime.

Modify the nutch-site.xml under conf to add the setting (the value in it is as you enter ):

<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <Configuration> <property> <Name> HTTP. agent. name </Name> <value> your agent name </value> </property> <Name> HTTP. agent. description </Name> <value> EIP </value> </property> <Name> HTTP. agent. URL </Name> <value> your HTTP Agent URL </value> </property> <Name> HTTP. agent. email </Name> <value> your HTTP agent email </value> </property> <Name> storage. data. store. class </Name> <value> Org. apache. gora. cassandra. store. cassandrastore </value> <description> sets the Gora storage layer implementation class, which can be set to the following parameter: Relational Database: Org. apache. gora. SQL. store. sqlstore CASSANDRA: Org. apache. gora. cassandra. store. cassandrastore habse: Org. apache. gora. hbase. store. hbasestore accumulo: Org. apache. gora. hbase. store. accumulostore Avro: Org. apache. gora. hbase. store. avrostore File Format: Org. apache. gora. hbase. store. datafileavrostore in memory: Org. apache. gora. hbase. store. memstore </description> </property> </configuration>

Modify/nutch/CONF/Gora. properties under Conf

Comment out the configuration information of the relational database:

# Gora. sqlstore. JDBC. Driver = org. HSQLDB. JDBC. jdbcdriver
# Gora. sqlstore. JDBC. url = JDBC: HSQLDB: // localhost/test
# Gora. sqlstore. JDBC. User = sa
# Gora. sqlstore. JDBC. Password =

Remove the comments of Gora. cassandrastore. servers = localhost: 9160, indicating that Cassandra is used as the storage layer.

5. Start cassandra

Start Cassandra. For details about how to start Cassandra, refer to restart.

6. Execute Crawler

Create a URLs folder under the root directory, create a TXT file under the folder, and enter a few websites, such as http://www.163.com /.

Finally, run/nutch/src/Java/org/Apache/nutch/Crawl/crawler with Java application. java parameter: the URL-depth 2 crawler is executed, and the crawled webpage is saved to Cassandra.

The same as the index nutch2.0 and nutch1.3 +, SOLR is used as a search program. Therefore, the usage is the same as that of nutch1.3 +. I will not introduce it here. I am used to elasticsearch. Now I really think SOLR is too troublesome, configuration is too lazy to be configured for debugging. You can directly change the source code and add the method for indexing the nutch document to elasticsearch.


References: http://www.searchtech.pro/articles/2013/02/18/1361191389790.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.