Configure integration of Nutch1.7 and Solr4.6 in Ubuntu13.10

Source: Internet
Author: User
Tags xsl
1. The system is ready to install Ubuntu13.10, set the source, then sudoapt-getupdate and sudoapt-getupgrade2. Related Software preparation (1) install the antsudoapt-getinstallant1.7, check installation status ant-version appears ApacheAntversion1.7.1compiledonSeptember32011 indicates installation successful. (

1. System Preparation
Install Ubuntu13.10, set the source, and then sudo apt-get update and sudo apt-get upgrade

2. Related Software preparation
(1) install ant
Sudo apt-get install ant1.7, check the installation situation ant-version appears


Apache Ant version 1.7.1 compiled on September 3 2011

Indicates that the installation is successful.

(2) jdk installation and configuration
Download jdk from the official website and decompress it to the/opt/jdk directory.

Environment variable configuration: add content at the end of sudo gedit/etc/profile

Export JAVA_HOME =/opt/jdk
Export PATH = $ JAVA_HOME/bin: $ PATH
Export CLASSPATH =.: $ JAVA_HOME/lib/dt. jar: $ JAVA_HOME/lib/tools. jar

Save and launch. source/etc/profile makes the configuration take effect.

Test: both java-version and java have content (the content does not need to be pasted)

(3) nutch
Download nutch1.7 and decompress it to/opt/nutch.

Cd/opt/nutch

Bin/nutch
In this case, usage help is displayed, indicating that the installation is successful. Perform the following configurations.

Step 1: Modify the file conf/nutch-site.xml and set the name of the agent in the HTTP request:





Http. agent. name
Friendly Crawler


Step 2: Create a seed folder
Mkdir-p urls

Step 3: Write the seed URL to the file urls/seed.txt: sudo gedit seed.txt
Http://www.linuxidc.com

Step 4: Configure conf/regex-urlfilter.txt
# Accept anything else
# +.

# Added by yoyo
+ 36kr.com

Step 5: Modify the conf/nutch-site.xml and add a parser. skip. truncated attribute in it:

Parser. skip. truncated
False

This is because tcpdump or wireshark is used to capture packets, and the page content of the website is returned in segments using truncate. However, the default settings of nutch do not support this method and you need to enable it,
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html

Step 6: Crawl the experiment

Bin/nutch crawl urls-dir crawl

(4) Solr Installation
Download solr4.6 and decompress it to/opt/solr.

Cd/opt/solr/example

Java-jar start. jar

If the webpage http: // localhost: 8983/solr/can be opened normally, the operation is successful.

3. Integration of Nutch and Solr
(1) environment variable settings:
Add sudo gedit/etc/profile

Export NUTCH_RUNTIME_HOME =/opt/nutch

Export APACHE_SOLR_HOME =/opt/solr

(2) Integration
Mkdir $ {APACHE_SOLR_HOME}/example/solr/conf
Cp $ {NUTCH_RUNTIME_HOME}/conf/schema. xml $ {APACHE_SOLR_HOME}/example/solr/conf/

Restart solr:

Java-jar start. jar

Index creation:


Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/

Error:

Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication


Exception in thread "main" java. io. IOException: Job failed!
At org. apache. Hadoop. mapred. JobClient. runJob (JobClient. java: 1357)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 123)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 81)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 65)
At org. apache. nutch. crawl. Crawl. run (Crawl. java: 155)
At org. apache. hadoop. util. ToolRunner. run (ToolRunner. java: 65)
At org. apache. nutch. crawl. Crawl. main (Crawl. java: 55)

Solution is reference http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch

Similarly, some other fields need to be added by editing ~ /Solr-4.4.0/example/solr/collection1/conf/schema. xml, in ...Add the following fields:





(3) Verification
Rm crawl/-Rf

Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/

............

............

CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 08:55:30, elapsed: 00:00:01
LinkDb: starting at 08:55:30
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085430
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085441
LinkDb: finished at 08:55:31, elapsed: 00:00:01
Indexer: starting at 08:55:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication


Indexer: finished at 08:55:35, elapsed: 00:00:03
SolrDeleteDuplicates: starting at 08:55:35
SolrDeleteDuplicates: Solr url: http: /localhost: 8983/solr/
SolrDeleteDuplicates: finished at 08:55:36, elapsed: 00:00:01
Crawl finished: crawl
To retrieve the captured content, open http: // localhost: 8983/solr/#/collection1/query in the browser and click Excute Query.

A detailed introduction to Nutch: Click here
The: Click here

Reading:

Nutch2.0 fully distributed deployment configuration http://www.linuxidc.com/Linux/2012-10/71977.htm

Nutch-2.0 cluster configuration http://www.linuxidc.com/Linux/2012-10/71976.htm

Learning notes: Basic Environment build and use http://www.linuxidc.com/Linux/2013-11/92891.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.