CentOS 6.5+nutch 1.7+SOLR 4.7+ik 2012

Source: Internet
Author: User
Tags solr

Environment

Linux version: CentOS 6.5

JDK Version: JDK 1.7

Nutch Version: Nutch 1.7

SOLR Version: SOLR 4.7

IK Version: Ik-analyzer 2012


Directory


1. Installing the JDK

2. Installing SOLR

3. Configure IK participle for SOLR

4. Installing Nutch


Content


1. Installing the JDK

1.1 Create the java/directory under/usr/, download the JDK package and unzip it

[Email protected] ~]# Mkdir/usr/java [[email protected] ~]# cd/usr/java[[email protected] ~]# Curl-o HTTP://DOWNLOAD.O Racle.com/otn-pub/java/jdk/7u75-b13/jdk-7u75-linux-x64.tar.gz[[email protected] java]# TAR–ZXVF Jdk-7u75-linux-x64.gz

1.2 Setting environment variables

[Email protected] java]# Vi/etc/profile

Add the following content:

#set JDK environmentjava_home=/usr/java/jdk1.7.0_75jre_home= $JAVA _home/jreclass_path=.: $JAVA _home/lib/dt.jar:$ Java_home/lib/tools.jar: $JRE _home/libpath= $PATH: $JAVA _home/bin: $JRE _home/binexport java_home Jre_homeclass_path PATH

Make the changes effective:

[Email protected] java]# Source/etc/profile

1.3 Verification

[Email protected] java# java-version

2. Installing SOLR

2.1 Create the SOLR directory under/usr/, download the SOLR installation package and unzip

[Email protected] ~]# mkdir/usr/solr[[email protected] ~]# cd/usr/solr[[email protected] solr]# Curl-o http://archive. Apache.org/dist/lucene/solr/4.7.0/solr-4.7.0.tgz[[email protected] solr]# tar–zxvfsolr-4.7.0.tgz

2.2 Start Jetty

This uses the jetty server that comes with SOLR

[[Email protected] solr]# CD Solr-4.7.0/example[[email protected] example]# Java-jar Start.jar

2.3 Verification

In the browser input: Http://10.192.87.198:8983/solr#/collection1/query


3. Configure IK participle for SOLR

3.1 Downloads ik-analyzer-2012

After decompression, upload IKAnalyzer.cfg.xml, Ikanalyzer2012_ff.jar, stopword.dic three files to/usr/solr/solr-4.7.0/example/solr-webapp/ Under the webapp/web-inf/lib/directory

3.2 Modifying the/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml configuration file

[Email protected] solr]# Cd/usr/solr/solr-4.7.0/example/solr/collection1/conf/[[email protected] solr]# VI Schema.xml

Add the following to the <type></types>:

<fieldtypename= "Text_ik" class= "SOLR. TextField "> <analyzer type=" index "ismaxwordlength=" false "class=" Org.wltea.analyzer.lucene.IKAnalyzer "/> <analyzer type= "Query" ismaxwordlength= "true" class= "Org.wltea.analyzer.lucene.IKAnalyzer"/></fieldtype >

3.3 Verification

Restart SOLR, open http://10.192.87.198:8983/solr/#/collection1/analysis, and test:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/6C/70/wKioL1VJoRWhJJ4LAAHm37pQh1I469.jpg "title=" Sogou May 06, 15 1258_2.png "alt=" Wkiol1vjorwhjj4laahm37pqh1i469.jpg "/>

Participle Result:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/6C/74/wKiom1VJnzWBTzyzAAHfJ3s8pkU018.jpg "style=" float: none; "title=" Sdsd.png "alt=" Wkiom1vjnzwbtzyzaahfj3s8pku018.jpg "/>

4. Installing Nutch

4.1 Create the Nutch directory under/usr/, download the Nutch installation package and unzip it

[[email protected] ~]# mkdir/usr/nutch[[email protected] ~]# cd/usr/nutch[[email protected] nutch]# Curl-o Http://archi Ve.apache.org/dist/nutch/1.7/apache-nutch-1.7-bin.tar.gz[[email protected] nutch]# TAR–ZXVF Apache-nutch-1.7-bin.tar.gz

4.2 Modifying the Nutch-site.xml configuration file

[[Email protected] nutch]# CD Apache-nutch-1.7/conf[[email protected] conf]# VI nutch-site.xml

In the <configuration> Add a field to the </configuration> as follows:

<configuration> <property> <name>http.agent.name</name> <value>friendly crawler</ value> </property> <property> <name>parser.skip.truncated</name> <value>false< /value> </property></configuration>

4.3 Modify the Regex-urlfilter.txt file to set filter rules

[Email protected] conf]# VI nutch-site.xml

Here is the address of the site you want to crawl with a regular expression.

As the following example, using regular expressions to limit the scope of the crawler is limited to the sohu.com domain

Before modification:

+.

After modification:

+^http://([a-z0-9]*\.) *sohu.com

4.4 Setting the site you want to crawl

[Email protected] conf]# cd/usr/nutch/apache-nutch-1.7[[email protected] apache-nutch-1.7]# mkdir urls[[email Protected] apache-nutch-1.7]# echo "http://www.sohu.com" >urls/seed.txt

4.5 Execute the command to crawl

[Email protected] apache-nutch-1.7]# bin/nutch crawl Urls-dir crawl-depth 2-TOPN 5

Use the tree to view the/usr/nutch/apache-nutch-1.7/crawl directory

[[email protected] apache-nutch-1.7]# tree crawl/crawl/├── crawldb│    ├── current│   │   └── part-00000│   │        ├── data│   │       └──  index│   └── old│       └── part-00000│            ├── data│            └── index├── linkdb│   └── current│        └── part-00000│            ├── data│           └── index└──  segments    ├── 20150326234924    │   ├──  content  &nbsP; │   │   └── part-00000    │   │       ├── data    │   │       └── index    │   ├── crawl_fetch     │   │   └── part-00000    │    │      ├── data    │   │       └── index    │   ├── crawl_ generate    │   │   └── part-00000     │   ├── crawl_parse    │   │   └──  part-00000    │   ├── parse_data    │    │   └── part-00000    │   │      ├── data     │   │      └── index     │   └── parse_text    │       └── part-00000    │           ├── data    │          └──  index    └── 20150326234933        ├──  content        │   └── part-00000         │      ├── data         │      └── index         ├── crawl_fetch        │   └── part-00000         │      ├── data         │      └── index         ├── crawl_generate        │   └──  part-00000        ├── crawl_parse         │   └── part-00000         ├── parse_data        │   └── part-00000         │      ├── data         │      └── index         └── parse_text            └── part-00000                 ├── data                 └── index

The data has been crawled.

4.6 Cimc SOLR

To edit the/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml file, add the following fields to <field>...</fields>:

<fieldname= "Host" type= "string" stored= "false" indexed= "true"/> <field name= "Digest" type= "string" stored= " True "indexed=" false "/> <field name=" segment "Type=" string "stored=" true "indexed=" false "/> <field name=" b Oost "type=" float "stored=" true "indexed=" false "/> <field name=" Tstamp "type=" date "stored=" true "indexed=" false "/> <field name=" anchor "type=" string "stored=" true "indexed=" true "multivalued=" true "/> <fieldname=" Cache "Type=" string "stored=" true "indexed=" false "/>

Restart SOLR, re-crawl

[Email protected] apache-nutch-1.7]# bin/nutch crawl urls-dir crawl-depth 2-topn 5-SOLR Http://10.192.86.156:8983/sol R

4.7 Viewing results

Enter Http://10.192.86.156:8983/solr#/collection1/query in the browser to query

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/6C/74/wKiom1VJpyThTpuJAAKE4JW3WFM246.jpg "title=" Sogou May 06, 15 1331_4.png "alt=" Wkiom1vjpythtpujaake4jw3wfm246.jpg "/>

CentOS 6.5+nutch 1.7+SOLR 4.7+ik 2012

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.