Very early, the official company started the development of nutch2.0, which has been both developed at the same time. One is the normal version, the other is the Gora version, that is, the nutch2.0. Next we will introduce how to import the project to eclipse. Here, our storage layer uses nosql Cassandra. I wanted to try MySQL first and found that the crawler cannot be started, after debugging, it is found that Gora's SQL database storage function has not been fully implemented, So Cassandra is easy to use for testing.
Knowledge required: Basic knowledge of nutch, basic knowledge of Cassandra, use Maven to manage projects, and use git to manage and download projects.
Tools Required: Install eclipse with Maven plug-in (the plug-in can be downloaded from the market place of eclipse)
1. Download the import Project
Download the nutch2.0project from https://github.com/apache/nutch/tree/release-2.0( the Windows zip button will be packaged and downloaded)
Import the project in eclipse (file-import-Maven-existing Maven project)
2. Add dependency
After being imported, src/Java and src/test are already in the source file directory.
/Conf
/Src/plugin/protocol-httpclient/src/Java
/Src/plugin/urlfilter-domain/src/Java
/Src/plugin/lib-HTTP/src/Java
/Src/plugin/protocol-HTTP/src/Java
/Src/plugin/urlfilter-suffix/src/Java
/Src/plugin/urlfilter-RegEx/src/Java
/Src/plugin/lib-RegEx-filter/src/Java
/Src/plugin/urlnormalizer-Basic/src/Java
/Src/plugin/urlnormalizer-pass
/Src/javasrc/plugin/urlnormalizer-RegEx/src/Java
/Src/plugin/scoring-OPIC/src/Java
/Src/plugin/parse-html/src/Java
These basic plug-ins are added to classpath, and additional jar packages are also required for parse-HTML,
And the nekohtml and tagsoup jar packages. We will add it in the POM below.
3. add an additional jar package to the POM File
Add the following dependencies to the Pom. xml file and remove the original Gora-core and Gora-SQL dependencies.
<dependency> <groupId>org.apache.gora</groupId> <artifactId>gora-core</artifactId> <version>0.2</version> <optional>true</optional> </dependency> <dependency> <groupId>org.apache.gora</groupId> <artifactId>gora-cassandra</artifactId> <version>0.2</version> <optional>true</optional> </dependency><!-- html parser dependency --><dependency><groupId>net.sourceforge.nekohtml</groupId><artifactId>nekohtml</artifactId><version>1.9.15</version></dependency><dependency><groupId>org.ccil.cowan.tagsoup</groupId><artifactId>tagsoup</artifactId><version>1.2</version></dependency>
If you cannot download the Gora package, go to http://gora.apache.org/releases.html#downloadand download it to the mavenrepository.
4. modify the configuration file
Remove all files in the conf directory that end with a template, such as nutch-site.xml.template to nutch-site.xml to prevent some plug-ins from finding the configuration file at runtime.
Modify the nutch-site.xml under conf to add the setting (the value in it is as you enter ):
<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <Configuration> <property> <Name> HTTP. agent. name </Name> <value> your agent name </value> </property> <Name> HTTP. agent. description </Name> <value> EIP </value> </property> <Name> HTTP. agent. URL </Name> <value> your HTTP Agent URL </value> </property> <Name> HTTP. agent. email </Name> <value> your HTTP agent email </value> </property> <Name> storage. data. store. class </Name> <value> Org. apache. gora. cassandra. store. cassandrastore </value> <description> sets the Gora storage layer implementation class, which can be set to the following parameter: Relational Database: Org. apache. gora. SQL. store. sqlstore CASSANDRA: Org. apache. gora. cassandra. store. cassandrastore habse: Org. apache. gora. hbase. store. hbasestore accumulo: Org. apache. gora. hbase. store. accumulostore Avro: Org. apache. gora. hbase. store. avrostore File Format: Org. apache. gora. hbase. store. datafileavrostore in memory: Org. apache. gora. hbase. store. memstore </description> </property> </configuration>
Modify/nutch/CONF/Gora. properties under Conf
Comment out the configuration information of the relational database:
# Gora. sqlstore. JDBC. Driver = org. HSQLDB. JDBC. jdbcdriver
# Gora. sqlstore. JDBC. url = JDBC: HSQLDB: // localhost/test
# Gora. sqlstore. JDBC. User = sa
# Gora. sqlstore. JDBC. Password =
Remove the comments of Gora. cassandrastore. servers = localhost: 9160, indicating that Cassandra is used as the storage layer.
5. Start cassandra
Start Cassandra. For details about how to start Cassandra, refer to restart.
6. Execute Crawler
Create a URLs folder under the root directory, create a TXT file under the folder, and enter a few websites, such as http://www.163.com /.
Finally, run/nutch/src/Java/org/Apache/nutch/Crawl/crawler with Java application. java parameter: the URL-depth 2 crawler is executed, and the crawled webpage is saved to Cassandra.
The same as the index nutch2.0 and nutch1.3 +, SOLR is used as a search program. Therefore, the usage is the same as that of nutch1.3 +. I will not introduce it here. I am used to elasticsearch. Now I really think SOLR is too troublesome, configuration is too lazy to be configured for debugging. You can directly change the source code and add the method for indexing the nutch document to elasticsearch.
References: http://www.searchtech.pro/articles/2013/02/18/1361191389790.html