Distributed cluster configuration of Nutch1.8 + Hadoop1.2 + Solr4.3

Last Update:2018-05-30 Source: Internet

Author: User

Tags solr tld hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and Web crawler. Of course, in Baidu encyclopedia, this method is no longer suitable for describing Nutch after Nutch1.2, because after version 1.2, it focuses on crawling data, and the full-text retrieval is complete.

Nutch is a search engine implemented by open-source Java. It provides all the tools we need to run our own search engine. Including full-text search and Web crawler. Of course, this method in Baidu encyclopedia is no longer suitable for the description of Nutch after Nutch1.2, because after version 1.2, it focuses on crawling data, however, the full-text retrieval part is completely handed over to Lucene, Solr, and ES. Of course, because they are close relatives, the full-text index can be generated easily after the data is captured by Nutch.

Next, let's go to the topic: the latest version of Nutch is 2.2.1, of which 2. in Version x, gora supports multiple storage methods. in Version x, the latest version is 1.8, which only supports HDFS storage. Here, we still use Nutch1.8. Why do we choose 1. what about the x series? This is actually related to your Hadoop environment. 2. the version of Hadoop2.x used by x's Nutch. Of course, if you are not in too much trouble, you can change the configuration of jar to make it run on the cluster of Hadoop1.x. Using 1.x's Nutch can easily run in 1.x's hadoop. The following is the configuration of the Nutch + Hadoop + Solr cluster in this test:

Serial number	Name	Responsibilities
1	Nutch1.8	It is mainly responsible for data crawling and supports distributed
2	Hadoop1.2.0	MapReduce is used for parallel crawling, HDFS is used for data storage, and the task of Nutch is submitted to the Hadoop cluster, supporting Distributed
3	Solr4.3.1	It is mainly responsible for searching, searching, and querying the crawled data. Massive Data can be distributed.
4	IK4.3	Mainly responsible for word segmentation of webpage content and titles to facilitate full-text search
5	Centos6.5	Linux: runs applications such as nutch and hadoop.
6	Tomcat7.0	Application server, providing container operation for Solr
7	JDK1.7	Provides a JAVA Runtime Environment
8	Ant1.9	Provides source code compilation such as Nutch.
9	One diaosi Software Engineer	Main Character

Start now.
1. First, make sure that your ant environment is configured successfully and everything is going on. It is best to do so in Linux. There is a high chance of a problem on windows. The source code of the downloaded nutch is, go to the root directory of nutch, execute ant, and wait for compilation to complete. After compilation, there will be the runtime Directory, which contains the commands started by Nutch, the local mode and the deploy distributed cluster mode.

2, configure the nutch-site.xml to add the following content:

 
 
 
 
    
   
    
Http. agent. name
     
   
    
Mynutch
   
  
    
   
    
Http. robots. agents
     
   
    
Mynutch ,*
     
   
    
The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. you shoshould put the value of http. agent. name as the first agent name, and keep the default * at the end of the list. e. g.: BlurflDev, Blurfl ,*
   
  
    
   
    
Plugin. folders
     
        
   
    
./Src/plugin
        
         
   
    
Plugins
           
     
   
    
Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.

3. Create the urls folder and mydir folder on the hadoop cluster.
The former is used to store the seed file address, and the latter stores the crawled data.
Hadoop fs-mkdir urls -- create a folder
Hadoop fs-put HDFS path local path-upload the seed file to HDFS.
Hadoop fs-ls/--- view the content in the path

4. It is important to configure the hadoop cluster and its environment variable HADOOP_HOME. During the running of Nutch, jobs will be submitted based on the Hadoop environment variables.

export  HADOOP_HOME=/root/hadoop1.2export PATH=$HADOOP_HOME/bin:$PATH ANT_HOME=/root/apache-ant-1.9.2export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROLexport JAVA_HOME=/root/jdk1.7export PATH=$JAVA_HOME/bin:$ANT_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

After the configuration is complete, you can use the which hadoop command to check whether the configuration is correct:

[root@master bin]# which hadoop/root/hadoop1.2/bin/hadoop[root@master bin]#

5. to configure the solr service, you need to set the schema under the conf file of the Nutch. copy the xml file to solr and overwrite the original schema of solr. xml file, and add IK word segmentation. The content is as follows:

6, after configuration, the/root/apache-nutch-1.8/runtime/deploy/bin directory

Run the following command:
./Crawl urls mydir http: /192.168.211.36: 9001/solr/2
Start the cluster capture task.
MapReduce in the capture is as follows:

After capturing, we can go to solr to view the captured content, as shown below:

So far, a simple crawling and search system have been completed. It is very easy to use. It is a series of open-source Lucene projects.

Summary: Several Typical errors encountered during the configuration process are recorded as follows:

Java. lang. exception: java. lang. runtimeException: Error in processing ing objectat org. apache. hadoop. mapred. localJobRunner $ Job. run (LocalJobRunner. java: 354) Caused by: java. lang. runtimeException: Error in processing ing objectat org. apache. hadoop. util. reflectionUtils. setJobConf (ReflectionUtils. java: 93) at org. apache. hadoop. util. reflectionUtils. setConf (ReflectionUtils. java: 64) at org. apache. hadoop. util. reflectionUtils. newInstance (ReflectionUtils. java: 117) at org. apache. hadoop. mapred. mapTask. runOldMapper (MapTask. java: 426) at org. apache. hadoop. mapred. mapTask. run (MapTask. java: 366) at org. apache. hadoop. mapred. localJobRunner $ Job $ MapTaskRunnable. run (LocalJobRunner. java: 223) at java. util. concurrent. executors $ RunnableAdapter. call (Executors. java: 441) at java. util. concurrent. futureTask $ Sync. innerRun (FutureTask. java: 303) at java. util. concurrent. futureTask. run (FutureTask. java: 138) at java. util. concurrent. threadPoolExecutor $ Worker. runTask (ThreadPoolExecutor. java: 886) at java. util. concurrent. threadPoolExecutor $ Worker. run (ThreadPoolExecutor. java: 908) at java. lang. thread. run (Thread. java: 662) Caused by: java. lang. reflect. invocationTargetExceptionat sun. reflect. nativeMethodAccessorImpl. invoke0 (Native Method) at sun. reflect. nativeMethodAccessorImpl. invoke (NativeMethodAccessorImpl. java: 39) at sun. reflect. delegatingMethodAccessorImpl. invoke (DelegatingMethodAccessorImpl. java: 25) at java. lang. reflect. method. invoke (Method. java: 597) at org. apache. hadoop. util. reflectionUtils. setJobConf (ReflectionUtils. java: 88 )... 11 moreCaused by: java. lang. runtimeException: Error in processing ing objectat org. apache. hadoop. util. reflectionUtils. setJobConf (ReflectionUtils. java: 93) at org. apache. hadoop. util. reflectionUtils. setConf (ReflectionUtils. java: 64) at org. apache. hadoop. util. reflectionUtils. newInstance (ReflectionUtils. java: 117) at org. apache. hadoop. mapred. mapRunner. configure (MapRunner. java: 34 )... 16 moreCaused by: java. lang. reflect. invocationTargetExceptionat sun. reflect. nativeMethodAccessorImpl. invoke0 (Native Method) at sun. reflect. nativeMethodAccessorImpl. invoke (NativeMethodAccessorImpl. java: 39) at sun. reflect. delegatingMethodAccessorImpl. invoke (DelegatingMethodAccessorImpl. java: 25) at java. lang. reflect. method. invoke (Method. java: 597) at org. apache. hadoop. util. reflectionUtils. setJobConf (ReflectionUtils. java: 88 )... 19 moreCaused by: java. lang. runtimeException: x point org.apache.nutch.net. URLNormalizer not found. at org.apache.nutch.net. URLNormalizers.
 
  
(URLNormalizers. java: 123) at org. apache. nutch. crawl. injector $ InjectMapper. configure (Injector. java: 74 )... 24 more2013-09-05 20:40:49, 329 INFO mapred. jobClient (JobClient. java: monitorAndPrintJob (1393)-map 0% reduce 0% 20:40:49, 332 INFO mapred. jobClient (JobClient. java: monitorAndPrintJob (1448)-Job complete: job_local1315110785_00012013-09-05 20:40:49, 332 INFO mapred. jobClient (Counters. Java: log (585)-Counters: 02013-09-05 20:40:49, 333 INFO mapred. jobClient (JobClient. java: runJob (1356)-Job Failed: NAException in thread "main" java. io. IOException: Job failed! At org. apache. hadoop. mapred. jobClient. runJob (JobClient. java: 1357) at org. apache. nutch. crawl. injector. inject (Injector. java: 281) at org. apache. nutch. crawl. crawl. run (Crawl. java: 132) at org. apache. hadoop. util. toolRunner. run (ToolRunner. java: 65) at org. apache. nutch. crawl. crawl. main (Crawl. java: 55) ========================================================== ====================================== solution: add the following configuration in the nutch-site.xml.
    
   
    
Plugin. folders
     
   
    
./Src/plugin
     
   
    
Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.

When executing the shell command
Using the bin/crawl urls mydir http: // 192.168.211.36: 9001/solr/2 command sometimes occurs, and some HDFS directories cannot be accessed correctly, therefore, the following command is recommended:

./Crawl urls mydir http: /192.168.211.36: 9001/solr/2

Http://itindex.net/detail/49582-nutch1.8-hadoop1.2-solr4.3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More