Environment
Linux version: CentOS 6.5 JDK Version: JDK 1.7 Nutch Version: Nutch 1.7 SOLR Version: SOLR 4.7 IK Version: Ik-analyzer 2012 |
Directory
1. Installing the JDK
2. Installing SOLR
3. Configure IK participle for SOLR
4. Installing Nutch
Content
1. Installing the JDK
1.1 Create the java/directory under/usr/, download the JDK package and unzip it
[Email protected] ~]# Mkdir/usr/java [[email protected] ~]# cd/usr/java[[email protected] ~]# Curl-o HTTP://DOWNLOAD.O Racle.com/otn-pub/java/jdk/7u75-b13/jdk-7u75-linux-x64.tar.gz[[email protected] java]# TAR–ZXVF Jdk-7u75-linux-x64.gz
1.2 Setting environment variables
[Email protected] java]# Vi/etc/profile
Add the following content:
#set JDK environmentjava_home=/usr/java/jdk1.7.0_75jre_home= $JAVA _home/jreclass_path=.: $JAVA _home/lib/dt.jar:$ Java_home/lib/tools.jar: $JRE _home/libpath= $PATH: $JAVA _home/bin: $JRE _home/binexport java_home Jre_homeclass_path PATH
Make the changes effective:
[Email protected] java]# Source/etc/profile
1.3 Verification
[Email protected] java# java-version
2. Installing SOLR
2.1 Create the SOLR directory under/usr/, download the SOLR installation package and unzip
[Email protected] ~]# mkdir/usr/solr[[email protected] ~]# cd/usr/solr[[email protected] solr]# Curl-o http://archive. Apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz[[email protected] solr]# tar–zxvfsolr-4.7.0.tgz
2.2 Start Jetty
This uses the jetty server that comes with SOLR
[[Email protected] solr]# CD Solr-4.7.0/example[[email protected] example]# Java-jar Start.jar
2.3 Verification
In the browser input: Http://10.192.87.198:8983/solr#/collection1/query
3. Configure IK participle for SOLR
3.1 Downloads ik-analyzer-2012
After decompression, upload IKAnalyzer.cfg.xml, Ikanalyzer2012_ff.jar, stopword.dic three files to/usr/solr/solr-4.7.0/example/solr-webapp/ Under the webapp/web-inf/lib/directory
3.2 Modifying the/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml configuration file
[Email protected] solr]# Cd/usr/solr/solr-4.7.0/example/solr/collection1/conf/[[email protected] solr]# VI Schema.xml
Add the following to the <type></types>:
<fieldtypename= "Text_ik" class= "SOLR. TextField "> <analyzer type=" index "ismaxwordlength=" false "class=" Org.wltea.analyzer.lucene.IKAnalyzer "/> <analyzer type= "Query" ismaxwordlength= "true" class= "Org.wltea.analyzer.lucene.IKAnalyzer"/></fieldtype >
3.3 Verification
Restart SOLR, open http://10.192.87.198:8983/solr/#/collection1/analysis, and test:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6C/70/wKioL1VJoRWhJJ4LAAHm37pQh1I469.jpg "title=" Sogou May 06, 15 1258_2.png "alt=" Wkiol1vjorwhjj4laahm37pqh1i469.jpg "/>
Participle Result:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6C/74/wKiom1VJnzWBTzyzAAHfJ3s8pkU018.jpg "style=" float: none; "title=" Sdsd.png "alt=" Wkiom1vjnzwbtzyzaahfj3s8pku018.jpg "/>
4. Installing Nutch
4.1 Create the Nutch directory under/usr/, download the Nutch installation package and unzip it
[[email protected] ~]# mkdir/usr/nutch[[email protected] ~]# cd/usr/nutch[[email protected] nutch]# Curl-o Http://archi Ve.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz[[email protected] nutch]# TAR–ZXVF Apache-nutch-1.7-bin.tar.gz
4.2 Modifying the Nutch-site.xml configuration file
[[Email protected] nutch]# CD Apache-nutch-1.7/conf[[email protected] conf]# VI nutch-site.xml
In the <configuration> Add a field to the </configuration> as follows:
<configuration> <property> <name>http.agent.name</name> <value>friendly crawler</ value> </property> <property> <name>parser.skip.truncated</name> <value>false< /value> </property></configuration>
4.3 Modify the Regex-urlfilter.txt file to set filter rules
[Email protected] conf]# VI nutch-site.xml
Here is the address of the site you want to crawl with a regular expression.
As the following example, using regular expressions to limit the scope of the crawler is limited to the sohu.com domain
Before modification:
+.
After modification:
+^http://([a-z0-9]*\.) *sohu.com
4.4 Setting the site you want to crawl
[Email protected] conf]# cd/usr/nutch/apache-nutch-1.7[[email protected] apache-nutch-1.7]# mkdir urls[[email Protected] apache-nutch-1.7]# echo "http://www.sohu.com" >urls/seed.txt
4.5 Execute the command to crawl
[Email protected] apache-nutch-1.7]# bin/nutch crawl Urls-dir crawl-depth 2-TOPN 5
Use the tree to view the/usr/nutch/apache-nutch-1.7/crawl directory
[[email protected] apache-nutch-1.7]# tree crawl/crawl/├── crawldb│ ├── current│ │ └── part-00000│ │ ├── data│ │ └── index│ └── old│ └── part-00000│ ├── data│ └── index├── linkdb│ └── current│ └── part-00000│ ├── data│ └── index└── segments ├── 20150326234924 │ ├── content &nbsP; │ │ └── part-00000 │ │ ├── data │ │ └── index │ ├── crawl_fetch │ │ └── part-00000 │ │ ├── data │ │ └── index │ ├── crawl_ generate │ │ └── part-00000 │ ├── crawl_parse │ │ └── part-00000 │ ├── parse_data │ │ └── part-00000 │ │ ├── data │ │ └── index │ └── parse_text │ └── part-00000 │ ├── data │ └── index └── 20150326234933 ├── content │ └── part-00000 │ ├── data │ └── index ├── crawl_fetch │ └── part-00000 │ ├── data │ └── index ├── crawl_generate │ └── part-00000 ├── crawl_parse │ └── part-00000 ├── parse_data │ └── part-00000 │ ├── data │ └── index └── parse_text └── part-00000 ├── data └── index
The data has been crawled.
4.6 Cimc SOLR
To edit the/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml file, add the following fields to <field>...</fields>:
<fieldname= "Host" type= "string" stored= "false" indexed= "true"/> <field name= "Digest" type= "string" stored= " True "indexed=" false "/> <field name=" segment "Type=" string "stored=" true "indexed=" false "/> <field name=" b Oost "type=" float "stored=" true "indexed=" false "/> <field name=" Tstamp "type=" date "stored=" true "indexed=" false "/> <field name=" anchor "type=" string "stored=" true "indexed=" true "multivalued=" true "/> <fieldname=" Cache "Type=" string "stored=" true "indexed=" false "/>
Restart SOLR, re-crawl
[Email protected] apache-nutch-1.7]# bin/nutch crawl urls-dir crawl-depth 2-topn 5-SOLR Http://10.192.86.156:8983/sol R
4.7 Viewing results
Enter Http://10.192.86.156:8983/solr#/collection1/query in the browser to query
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/6C/74/wKiom1VJpyThTpuJAAKE4JW3WFM246.jpg "title=" Sogou May 06, 15 1331_4.png "alt=" Wkiom1vjpythtpujaake4jw3wfm246.jpg "/>
CentOS 6.5+nutch 1.7+SOLR 4.7+ik 2012