1. The system is ready to install Ubuntu13.10, set the source, then sudoapt-getupdate and sudoapt-getupgrade2. Related Software preparation (1) install the antsudoapt-getinstallant1.7, check installation status ant-version appears ApacheAntversion1.7.1compiledonSeptember32011 indicates installation successful. (
1. System Preparation
Install Ubuntu13.10, set the source, and then sudo apt-get update and sudo apt-get upgrade
2. Related Software preparation
(1) install ant
Sudo apt-get install ant1.7, check the installation situation ant-version appears
Apache Ant version 1.7.1 compiled on September 3 2011
Indicates that the installation is successful.
(2) jdk installation and configuration
Download jdk from the official website and decompress it to the/opt/jdk directory.
Environment variable configuration: add content at the end of sudo gedit/etc/profile
Export JAVA_HOME =/opt/jdk
Export PATH = $ JAVA_HOME/bin: $ PATH
Export CLASSPATH =.: $ JAVA_HOME/lib/dt. jar: $ JAVA_HOME/lib/tools. jar
Save and launch. source/etc/profile makes the configuration take effect.
Test: both java-version and java have content (the content does not need to be pasted)
(3) nutch
Download nutch1.7 and decompress it to/opt/nutch.
Cd/opt/nutch
Bin/nutch
In this case, usage help is displayed, indicating that the installation is successful. Perform the following configurations.
Step 1: Modify the file conf/nutch-site.xml and set the name of the agent in the HTTP request:
Http. agent. name
Friendly Crawler
Step 2: Create a seed folder
Mkdir-p urls
Step 3: Write the seed URL to the file urls/seed.txt: sudo gedit seed.txt
Http://www.linuxidc.com
Step 4: Configure conf/regex-urlfilter.txt
# Accept anything else
# +.
# Added by yoyo
+ 36kr.com
Step 5: Modify the conf/nutch-site.xml and add a parser. skip. truncated attribute in it:
Parser. skip. truncated
False
This is because tcpdump or wireshark is used to capture packets, and the page content of the website is returned in segments using truncate. However, the default settings of nutch do not support this method and you need to enable it,
Reference: http://lucene.472066.n3.nabble.com/Content-Truncation-in-Nutch-2-1-MySQL-td4038888.html
Step 6: Crawl the experiment
Bin/nutch crawl urls-dir crawl
(4) Solr Installation
Download solr4.6 and decompress it to/opt/solr.
Cd/opt/solr/example
Java-jar start. jar
If the webpage http: // localhost: 8983/solr/can be opened normally, the operation is successful.
3. Integration of Nutch and Solr
(1) environment variable settings:
Add sudo gedit/etc/profile
Export NUTCH_RUNTIME_HOME =/opt/nutch
Export APACHE_SOLR_HOME =/opt/solr
(2) Integration
Mkdir $ {APACHE_SOLR_HOME}/example/solr/conf
Cp $ {NUTCH_RUNTIME_HOME}/conf/schema. xml $ {APACHE_SOLR_HOME}/example/solr/conf/
Restart solr:
Java-jar start. jar
Index creation:
Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/
Error:
Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication
Exception in thread "main" java. io. IOException: Job failed!
At org. apache. Hadoop. mapred. JobClient. runJob (JobClient. java: 1357)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 123)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 81)
At org. apache. nutch. indexer. IndexingJob. index (IndexingJob. java: 65)
At org. apache. nutch. crawl. Crawl. run (Crawl. java: 155)
At org. apache. hadoop. util. ToolRunner. run (ToolRunner. java: 65)
At org. apache. nutch. crawl. Crawl. main (Crawl. java: 55)
Solution is reference http://stackoverflow.com/questions/13429481/error-while-indexing-in-solr-data-crawled-by-nutch
Similarly, some other fields need to be added by editing ~ /Solr-4.4.0/example/solr/collection1/conf/schema. xml, in ...Add the following fields:
(3) Verification
Rm crawl/-Rf
Bin/nutch crawl urls-dir crawl-depth 2-topN 5-solrhttp: // localhost: 8983/solr/
............
............
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 08:55:30, elapsed: 00:00:01
LinkDb: starting at 08:55:30
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085430
LinkDb: adding segment: file:/opt/nutch/crawl/segments/20140303085441
LinkDb: finished at 08:55:31, elapsed: 00:00:01
Indexer: starting at 08:55:31
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters:
SOLRIndexWriter
Solr. server. url: URL of the SOLR instance (mandatory)
Solr. commit. size: buffer size when sending to SOLR (default 1000)
Solr. mapping. file: name of the mapping file for fields (default solrindex-mapping.xml)
Solr. auth: use authentication (default false)
Solr. auth. username: use authentication (default false)
Solr. auth: username for authentication
Solr. auth. password: password for authentication
Indexer: finished at 08:55:35, elapsed: 00:00:03
SolrDeleteDuplicates: starting at 08:55:35
SolrDeleteDuplicates: Solr url: http: /localhost: 8983/solr/
SolrDeleteDuplicates: finished at 08:55:36, elapsed: 00:00:01
Crawl finished: crawl
To retrieve the captured content, open http: // localhost: 8983/solr/#/collection1/query in the browser and click Excute Query.
A detailed introduction to Nutch: Click here
The: Click here
Reading:
Nutch2.0 fully distributed deployment configuration http://www.linuxidc.com/Linux/2012-10/71977.htm
Nutch-2.0 cluster configuration http://www.linuxidc.com/Linux/2012-10/71976.htm
Learning notes: Basic Environment build and use http://www.linuxidc.com/Linux/2013-11/92891.htm