Hotel Reviews Affective Analysis System (II.)-- Nutch installation
First, the demand part
- Nutch is Java-developed, so Java JDK needs to be downloaded.
http://java.sun.com/javase/downloads/index.jsp
2. Nutch's Demo search page is JSP and requires Tomcat to do the server.
:http://jakarta.apache.org/tomcat/
3. Nutch scripts are written in the Linux shell, so a shell interpreter is required on the Windows platform. Cygwin is a simulated Linux system program under Windows. (Note that you do not need to download this program under Linux)
: http://www.cygwin.com/
4. Nutch:http://lucene.apache.org/nutch/
Second, the environment
- Operating system: WINDOWS7,X86,32 bit
- Java JDK1.6
- Tomcat 7.0
- Cygwin2.850
- Nutch1.7
Third, installation steps
1. Java JDK Installation
Note : The path name does not take Chinese, the recommended path does not have a space, the first time I chose the path with a space C:\Program Files, the execution of the crawl command when the error occurred:
The C:\Program directory is not found, the reason for this problem is because: C:\Program files\ in the middle of a space, so as to enter the program Files, and can only enter the program, but there is no program folder in the C drive.
After installation, set environment variables, win7 environment variables and XP, in the system variables or user variables are OK. Assuming your JDK is installed in c:\jdk1.6, configure it as follows:
java_home=c:\jdk1.6
Classpath=. ;%java_home%\lib\dt.jar;%java_home%\lib\tools.jar; Must not be less because it represents the current path)
Path=%java_home%\bin
After the variable is installed, enter "CMD" in the Run to open the command line, enter "Java" Separately, "java–version" if the specific information is displayed without error, then the installation succeeds, such as:
If you do not print out this sentence, you need to carefully check your configuration situation.
2. Tomcat-Free Installation
Here's a question to note:
You need to download the version of Tomcat that matches the JDK, such as:
My JDK version is 1.6, and then before loading Tomcat8.0, configure the path, the point startup.bat when the flash-off phenomenon.
To extract tomcat to a directory without Chinese, set environment variables:
(1) Variable name: Tomcat_home Variable Value:
H:\tomcat7.0 (Tomcat extracted to the directory)
(2) Variable name: Catalina_home Variable Value:
H:\tomcat7.0
(3) Modify variable: Path variable Value:
Add the following at the end;%catalina_home%\bin;%catalina_home%\lib
Run Tomcat7.0, start, run, input cmd, enter the following path
Enter Startup.bat at the command prompt, and the Tomcat command box will pop up to output the boot log;
Then open the browser input http://localhost:8080/ , if you enter the Tomcat welcome interface, then congratulations, the configuration is successful.
Tomcat's running and stopping files are Startup.bat and Shutdown.bat, respectively.
3. Cygwin installation
After you run the Setup program, such as:
You can choose a Web address casually:
This step, we choose to download the installed component package, in order to enable our installed Cygwin to compile the program, we need to install the GCC compiler, by default, GCC will not be installed, we need to select it to install. In order to install GCC, we use the mouse to click on the "Devel" branch in the component list, where there are many components, we must:
binutils , GCC , Gcc-mingw, GDB
Binutils components:
GCC components:
GDB components:
GCC-MINGW components:
When you are finished, choose Next:
The time of installation depends on the components you select and the network conditions.
4. Nutch Installation
Nutch is a Java-implemented web crawler, and the results of crawling are stored in database (a series of files and directories under a specified file path) for SOLR or Lucene indexing and retrieval.
A list of basic features of common search-related frameworks:
|
Crawl |
Index |
Retrieval |
Nutch |
√ |
|
|
Solr |
|
√ |
√ |
Lucene |
|
√ |
√ |
Download the installation apache-nutch-1.7-bin.zip and set it up. :http://archive.apache.org/dist/nutch/
After the download is complete, unzip the Nutch binary bundle, (I unzipped in: H:\nutch\nutch1.7) directory as follows:
L Bin directory, contains only one executable file Nutch
L conf directory, configuration parameters for nutch command execution
L Docs catalogue, Javadoc Help
L Lib directory, related Jar class Library
L plugins directory, related plugin library
Set Environment variables:
Variable name Nutch_java_home
Variable Value%java_home% "its value is set to the JDK's installation directory"
Run Cygwin, go to the decompression path where the nutch1.7 is located, and in the input bin/nutch,
Nutch installation was successful.
(2.1) Installation of Nutch1.7 under Windows