Configure integration of Nutch1.7 and Solr4.6 in Ubuntu13.10

Last Update:2017-10-25 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The system is ready to install Ubuntu13.10, set the source, then sudoapt-getupdate and sudoapt-getupgrade2. Related Software preparation (1) install the antsudoapt-getinstallant1.7, check installation status ant-version appears ApacheAntversion1.7.1compiledonSeptember32011 indicates installation successful. (

1. System Preparation
Install Ubuntu13.10, set the source, and then sudo apt-get update and sudo apt-get upgrade

2. Related Software preparation
(1) install ant
Sudo apt-get install ant1.7, check the installation situation ant-version appears

Apache Ant version 1.7.1 compiled on September 3 2011

Indicates that the installation is successful.

(2) jdk installation and configuration
Download jdk from the official website and decompress it to the/opt/jdk directory.

Environment variable configuration: add content at the end of sudo gedit/etc/profile

Export JAVA_HOME =/opt/jdk
Export PATH = $ JAVA_HOME/bin: $ PATH
Export CLASSPATH =.: $ JAVA_HOME/lib/dt. jar: $ JAVA_HOME/lib/tools. jar

Save and launch. source/etc/profile makes the configuration take effect.

Test: both java-version and java have content (the content does not need to be pasted)

(3) nutch
Download nutch1.7 and decompress it to/opt/nutch.

Cd/opt/nutch

Bin/nutch
In this case, usage help is displayed, indicating that the installation is successful. Perform the following configurations.

Step 1: Modify the file conf/nutch-site.xml and set the name of the agent in the HTTP request:

Http. agent. name
Friendly Crawler

Step 2: Create a seed folder
Mkdir-p urls

Step 3: Write the seed URL to the file urls/seed.txt: sudo gedit seed.txt
Http://www.linuxidc.com

Step 4: Configure conf/regex-urlfilter.txt
# Accept anything else
# +.

# Added by yoyo
+ 36kr.com

Step 5: Modify the conf/nutch-site.xml and add a parser. skip. truncated attribute in it:

Parser. skip. truncated
False

This is because tcpdump or wireshark is used to capture packets, and the page content of the website is returned in segments using truncate. However, the default settings of nutch do not support this method and you need to enable it,
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

Step 6: Crawl the experiment

Bin/nutch crawl urls-dir crawl

(4) Solr Installation
Download solr4.6 and decompress it to/opt/solr.

Cd/opt/solr/example

Java-jar start. jar

If the webpage http: // localhost: 8983/solr/can be opened normally, the operation is successful.

3. Integration of Nutch and Solr
(1) environment variable settings:
Add sudo gedit/etc/profile

Export NUTCH_RUNTIME_HOME =/opt/nutch

Export APACHE_SOLR_HOME =/opt/solr

(2) Integration
Mkdir $ {APACHE_SOLR_HOME}/example/solr/conf
Cp $ {NUTCH_RUNTIME_HOME}/conf/schema. xml $ {APACHE_SOLR_HOME}/example/solr/conf/

Restart solr:

Java-jar start. jar

Index creation:

Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/

Error:

Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication

Exception in thread "main" java. io. IOException: Job failed!
At org. apache. Hadoop. mapred. JobClient. runJob (JobClient. java: 1357)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 123)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 81)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 65)
At org. apache. nutch. crawl. Crawl. run (Crawl. java: 155)
At org. apache. hadoop. util. ToolRunner. run (ToolRunner. java: 65)
At org. apache. nutch. crawl. Crawl. main (Crawl. java: 55)

Solution is reference http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch

Similarly, some other fields need to be added by editing ~ /Solr-4.4.0/example/solr/collection1/conf/schema. xml, in ...Add the following fields:

(3) Verification
Rm crawl/-Rf

Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/

............

CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 08:55:30, elapsed: 00:00:01
LinkDb: starting at 08:55:30
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085430
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085441
LinkDb: finished at 08:55:31, elapsed: 00:00:01
Indexer: starting at 08:55:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication

Indexer: finished at 08:55:35, elapsed: 00:00:03
SolrDeleteDuplicates: starting at 08:55:35
SolrDeleteDuplicates: Solr url: http: /localhost: 8983/solr/
SolrDeleteDuplicates: finished at 08:55:36, elapsed: 00:00:01
Crawl finished: crawl
To retrieve the captured content, open http: // localhost: 8983/solr/#/collection1/query in the browser and click Excute Query.

A detailed introduction to Nutch: Click here
The: Click here

Reading:

Nutch2.0 fully distributed deployment configuration http://www.linuxidc.com/Linux/2012-10/71977.htm

Nutch-2.0 cluster configuration http://www.linuxidc.com/Linux/2012-10/71976.htm

Learning notes: Basic Environment build and use http://www.linuxidc.com/Linux/2013-11/92891.htm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More