[Nutch] Nutch+eclipse+tomcat+solr+cygwin Building a Windows development environment

Source: Internet
Author: User
Tags solr xsl

1. Environment Preparation 1.1 software

Operating system: Windows 10 Pro
Ant version: Apache-ant-1.9.7-bin.zip
JDK version: Jdk-8u65-windows-x64.exe
SOLR version: Solr-4.9.1.zip
Nutch version: apache-nutch-1.6-bin.tar.gz
Tomcat version: Apache-tomcat-9.0.0.m8-windows-x64.zip
Eclipse Version: Eclipse-jee-mars-1-win32-x86_64.zip
The following is the installed Eclipse plugin:

Ivyde plugin:
Ivy:
Plugins
Org.apache.ivy.eclipse.ant_2.4.0.final_20141213170938.jar
Org.apache.ivy_2.4.0.final_20141213170938.jar
Features
Org.apache.ivy.feature_2.4.0.final_20141213170938.jar

Ivyde:
Plugins:org.apache.ivyde.eclipse_2.2.0.final-201311091524-release.jar
Features:org.apache.ivyde.feature_2.2.0.final-201311091524-release.jar

1.2 JDK installation Configuration

Double click "Jdk-8u65-windows-x64.exe" to install, we all the way to click Next, installed by default on the C drive, below is the directory we installed the JDK.

Configure the Java environment variable below: Right-click, advanced system settings, properties----

1.2.1 Click "New", then the variable name "Java_home", fill in the above content.
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_65

As follows:

Note: You must not add a semicolon after the value of the Java_home variable.

1.2.2 Second Step: Click "New", then the variable name "CLASSPATH", fill in the above content.
CLASSPATH=.;%JAVA_HOME%\lib;%JAVA_HOME%\jre\lib

As follows:

Note: To add a dot. Represents the current path.

1.2.3 The third step: click "New", then the variable name "Nutch_java_home", fill in the above content.
NUTCH_JAVA_HOME=%JAVA_HOME%
1.2.4 Fourth: Find the path in the system variable and click Edit. followed by the surface content.
PATH=……;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin

As follows:

Remark: When appending, use ";" Split with the previous value.

1.3 Ant installation Configuration

Unzip "Apache-ant-1.9.7-bin.zip" into "C:\NutchWorkPlat" and rename it "Ant".

Configure the ANT environment variable below: Right-click, advanced system settings, properties------

1.3.1 First step: Click "New", then the variable name "Ant_home", fill in the following content.
ANT_HOME= C:\NutchWorkPlat\ant

As follows:

Note: You must not add a semicolon after the value of the Ant_home variable.

1.3.2 The second step: Find the path in the system variable and click Edit. Append the following content to the back.
PATH=……;%ANT_HOME%\bin;%ANT_HOME%\lib

As follows:

Remark: When appending, use ";" Split with the previous value.

1.4 Ivyde Installation Configuration 1.4.1 Ivyde-plugins
把"org.apache.ivyde.eclipse_2.2.0.final-201311091524-RELEASE.jar"复制到Eclipse安装目录的"plugins"中。
1.4.2 Ivyde-features
把"org.apache.ivyde.feature_2.2.0.final-201311091524-RELEASE.jar"解压到Eclipse安装目录的"features"中。备注:是解压之后放到目录"features"中,而不是直接把jar包放到里面,不然启动Eclipse后,打开WindowàShow ViewàError log后,提示"Unable to find feature.xml in directory"。
1.4.3 Ivy-plugins

Put "Org.apache.ivy.eclipse.ant_2.4.0.final_20141213170938.jar" and "Org.apache.ivy_2.4.0.final_20141213170938.jar" Copy to "Plugins" in the Eclipse installation directory.

1.4.4 Ivy-features

Extract "Org.apache.ivy.feature_2.4.0.final_20141213170938.jar" to "features" in the Eclipse installation directory.

After completing the above steps, restart Eclipse, open the Windowàpreference dialog box and see Ivy Column; open Help->about eclipse->installation-> You can also see two ivy, a ivyde in the list of plug-ins.

1.5 Tomcat installation configuration 1.5.1 first install Tomcat

Extract "Apache-tomcat-9.0.0.m8-windows-x64.zip" into the "C:\NutchWorkPlat" directory and rename it "Tomcat".

Go to "C:\NutchWorkPlat\tomcat\bin" and click "Startup.bat", then the following screen appears.

Then enter "http://localhost:8080/" in the browser, the following interface appears, indicating that the installation was successful.

1.5.2 then installs the Tomcat Eclipse plugin and lets Eclipse and Tomcat combine

Eclipse->help->install New software inside, select Add, respectively, fill in the following content:

Name:TomcatLocation:http://tomcatplugin.sf.net/update

As follows:

Select OK and select Next after Tomcat plugin to install:

Restart eclipse after the installation is complete.

Open the Window->preference dialog box and see the Tomcat column, click Tomcat to associate the extracted Tomcat with the following:

Click "Start Tomcat" on the toolbar to start Tomcat.

The Eclipse console then prints information about the startup Tomcat.

At this point, you can enter "http://localhost:8080/" in the browser again to verify whether the startup was successful.

1.6 Cygwin installation Configuration

Nutch is based on Hadoop, because Hadoop only runs on Linux, which involves a lot of operating Linux programs, so we must install the Cygwin environment in the Windows deployment, simulate the Linux operation.

In the dialog box shown, click "Next", the interface appears three kinds of installation mode:

Install from the Internet, this mode is installed directly from the Internet, suitable for fast speed situation;
Download without installing, this mode only downloads the Cygwin component package from the Internet, but does not install;
Install from local Directory, this mode corresponds to the second mode above, and when your Cygwin component package has been downloaded locally, you can use this mode to install Cygwin locally.

Here we choose the first "install from the Internet" method to install, and then always use the default values, select "Next" Until the dialog box appears as shown:

After entering the "Select Packages" dialog box, actually directly click "Next", for the default installation, in order to later Cygwin the following to build the Hadoop environment, so installed some software.
-OpenSSL
-SED
-Vim

You must ensure that "OpenSSL" under "Net Category" is installed as shown in:

If you also want to compile Hadoop on eclipse, you must also install sed under Base Category, as shown in:

It is also recommended to install Vim under the "Editors Category" to make it easy to modify the configuration file directly on Cygwin, as shown in:

It is recommended that you install subversion under "Devel Category" as shown in:

When you are done, click "Next" in the "Select Packages" dialog box to enter the Cygwin installation package download process.

After installing the Cygwin software, we also need to set its environment variables.

1.5.1 First step: Click "New", then the variable name "Cygwin_home", fill in the following content.
CYGWIN_HOME= C:\cygwin64

As follows:

1.5.2 The second step: Find the path in the system variable and click Edit. Append the following content to the back.
PATH=……;% CYGWIN_HOME %\bin

As follows:

2. The Eclipse Development 2.1 SOLR Deployment 2.1.1 First step: Extract "Solr-4.9.1.zip" to the "C:\NutchWorkPlat" directory and Name "SOLR".

2.1.2 The second step: rename "Apache-solr-4.9.1.war" under "C:\NutchWorkPlat\solr\dist" to "Solr.war" and put it in "C:\NutchWorkPlat\tomcat\ WebApps "directory below. 2.1.3 Step Three: Modify the Tomcat configuration file "C:\NutchWorkPlat\tomcat\conf\server.xml" To add Chinese encoding support.

2.1.4 Fourth: Copy the "SOLR" folder under "C:\NutchWorkPlat\solr\example" to the "C:\NutchWorkPlat\tomcat" directory along with the contents.

2.1.5 Fifth Step: Create a "solr.xml" file under "C:\NutchWorkPlat\tomcat\conf\Catalina\localhost", as follows:
<?xml version= "1.0" encoding= "UTF-8"?  <context  docBase  =  "C:\NutchWorkPlat\tomcat\webapps\solr.war"  debug  = "0"   Crosscontext  =;  <environment  name  =" Solr/home "  type  = "java.lang.String"  value  =" C:\NUTCHWORKPLAT\TOMCAT\SOLR " override  = "true" />  </context ;  
2.1.6 Sixth step: Modify "C:\NutchWorkPlat\tomcat\solr\conf\ Solrconfig.xml" to find the following sentence.
<queryResponseWritername="velocity"class="solr.VelocityResponseWriter" enable="${solr.velocity.enabled:true}"/>

Change true in Enable= "${solr.velocity.enabled:true} to False.

2.1.7 Seventh: Restart Tomcat, enter "http://localhost:8080/solr/" and the following interface appears.

2.2 Nutch Import 2.2.1 First step: Extract "apache-nutch-1.6-bin.tar.gz" into the "C:\NutchWorkPlat" directory and rename "Nutch".

2.2.2 Second step: Create a new Java Project in Eclipse, define the name as Nutch1.6, remove the checkmark in front of the default path (use default location), and select "C:\NutchWorkPlat\nutch". Other remain default, click "Next". 2.2.3 Step Three: Select "Librariesàadd Class Folder ..." and select "Conf" from the list to add Conf to classpath.

2.2.4 Fourth step: Don't worry about "next", select "Order and Export", select "Conf", click "Top" to make it sticky, this step is very critical, click "Finish".

2.2.5 Fifth: Under the "Nutch1.6" project root directory, create the "URLs" folder (with SRC, conf sibling), in which a file called "Urls.txt" is created, in which the following content is added:

Http://www.cnbeta.com

2.2.6 Sixth: In the "Nutch1.6" Project root directory conf folder, edit "Nutch-site.xml", so that its contents are as follows:
<?xml version= "1.0"?><?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?><!--Put Site-specific property overrides in this file.--<configuration>    <property >        <name>Http.agent.name</name>        <value>My Nutch Spider</value>    </Property >    <property >        <name>Plugin.folders</name>        <value>./src/plugin</value>    </Property ></configuration>

Note: where "http.agent.name" and "plugin.folders" must be set, or "Job Failure" will appear.

2.2.7 Seventh step: In the "Nutch1.6" Project root directory conf folder, edit "Regex-urlfilter.txt", under "# Accept anything else" enter: "+^http://(. )", and then save.

2.2.8 Eighth step: After the above configuration, you can crawl to the Web page, select the "Nutch1.6" Project right click Select "Run as->run Configurations", find "Java application", then right-click to select "New", Select "Org.apache.nutch.crawl.Crawl" in main class and name "Crawl".

2.2.9 Nineth: Then fill in the "Arguments" tab below, then click "Apply and Run".
-dirdata-depth3-threads5-topN100VM arguments:-Dhadoop.log.dir=-Dhadoop.log.file=hadoop.log

2.3 Solr combined with Nutch

After the above steps, the specified page has been crawled locally, and now we are indexing the page we downloaded.

2.3.1 First step: Copy the "Schema.xml" under "E:\NutchWorkPlat\nutch\conf" to the Tomcat installation directory "E:\NutchWorkPlat\tomcat\solr\conf", Overwrite the original file. Schema.xml the indexed field, the stored= "false" after the content item is changed to Stored= "true" will contain the specific content containing the keyword in the search return value. 2.3.2 Second Step: Click "Start Tomcat" on the Eclipse toolbar to start Tomcat.
备注:如果Tomcat已经起来了,在第一步完成之后,也应该重启使其有效,如果不起动Tomcat,在建立索引时会失败。
2.3.3 The third step: After the above configuration, you can set up the index, select the "Nutch1.6" Project right click Select "Run asàrun Configurations", find "Java application", then right click to select "New", Select "Org.apache.nutch.indexer.solr.SolrIndexer" in main class and name it "Solrindexer".

2.3.4 Fourth: Then fill in the "Arguments" tab below, then click "Apply and Run".
Program Arguments:http://localhost:8080/solr/ data/crawldb -linkdb data/linkdb data/segments/*VM arguments:-Dhadoop.log.dir=-Dhadoop.log.file=hadoop.log

The following is the Eclipse console output information:

Solrindexer:starting at 2016-06-18 14:45:41

Adding 352 Documents

Solrindexer:finished at 2016-06-18 14:45:56, elapsed:00:00:14

2.5.5 Fifth Step: Enter "http://localhost:8080/solr/admin/" in the browser and enter the keywords in the query criteria, click "Search" to query.

The following is the result of the query, displayed as XML results.

So far, nutch two development of the pre-work has been prepared, and in the above for a simple crawl, we will nutch the source code and working principle of the combination of analysis. Further understanding of Nutch.

[Nutch] Nutch+eclipse+tomcat+solr+cygwin Building a Windows development environment

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.