[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

Last Update:2016-06-21 Source: Internet

Author: User

Tags create directory solr zookeeper hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous blog post describes the development environment under the Windows 10 system using Cygwin to build nutch, this article will introduce Nutch2.3 under the Ubuntu environment.

1. Required software and its version

Ubuntu 15.04
Hadoop 1.2.1
HBase 0.94.27
Nutch 2.3
SOLR 4.9.1

2. System Environment Preparation 2.1 installing Ubuntu operating system

Basic requirements, there are many online, self-installation, there are questions can leave a message.

2.2 Create a separate Kandy user

useradd kandy

2.3 Setting a password

passwd kandy

2.4 Turn on Administrator privileges

vi /etc/sudoers

Add a line:

ALL=(ALLALL

2.5 Ensure that the configuration localhost address is mapped to 127.0.0.1

vi /etc/hosts

Add the following content:

127.0.0.1 localhost

2.6 Set No password login, to SSH localhost can be implemented without password login 2.6.1 generated key

Use the following command:

ssh-keygen-t rsa

Use all default options:

2.6.2 Copying keys

Use the following command:

cp  .ssh/id_rsa.pub  .ssh/authorized_keys

2.6.3 Verification

Enter the following command:

ssh localhost

2.7 Ensure that the JDK is installed and configured with Java_home environment variables, such as:

JAVA_HOME=/usr/lib/jvm/java-8-oracle

3 install Hadoop 1.2.13.1 download hadoop-1.2.1.tar.gz and unzip to/usr/local/hadoop, modify directory permissions:

sudo777 -R hadoop

3.2 Create/data/hadoop-data Directory 3.3 modify./conf/core-site.xml is configured as follows:

<configuration>    <property >        <name>Hadoop.tmp.dir</name>        <value>/data/hadoop-data/tmp</value>        <description>Abase for other temporary directories.</Description>    </Property >    <property >        <name>Fs.default.name</name>        <value>hdfs://localhost:9000</value>    </Property ></configuration>

3.4 Modify the Java_home configuration in conf/hadoop-env.sh as follows:

JAVA_HOME=/usr/lib/jvm/java-8-oracle

3.5 Modify the./conf/hdfs-site.xml configuration as follows:

<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property></configuration>

3.6 Modify the./conf/mapred-site.xml configuration as follows:

<configuration>    <property>        <name>mapred.job.tracker</name>        <value>localhost:9001</value>    </property></configuration>

3.7 Configuring environment variables

export HADOOP_PREFIX=/usr/local/hadoop/export PATH=${HADOOP_PREFIX}/bin/:${PATH}

3.8 Formatting Namenode

hadoop namenode -format

3.9 Starting Hadoop

start-all.sh

3.10 Check if Hadoop starts correctly

hadoop fs -ls /

3.11 Tasktracker Status Page

http://localhost:50060/tasktracker.jsp

3.12 Jobtracker Status Page

http://localhost:50030/jobtracker.jsp

3.13 Datanode Status Page

http://localhost:50075

4 install hbase 0.94.274.1 download hbase-0.94.27.tar.gz and unzip to/usr/local/hbase, modify directory Permissions

sudo777 -R hbase

4.2 Create directory/data/hbase/zookeeper/4.3 modify the./conf/hbase-env.sh in the Java_home configuration as follows

JAVA_HOME=/usr/lib/jvm/java-8-oracle

and open:

export HBASE_MANAGES_ZK=true

4.4 Modification./conf/hbase-site.xml

<configuration>    <property >        <name>Hbase.rootdir</name>        <value>Hdfs://localhost:9000/hbase</value>    </Property >    <property >        <name>hbase.cluster.distributed</name>        <value>True</value>    </Property >    <property >        <name>Hbase.zookeeper.property.dataDir</name>        <value>/data/hbase/zookeeper</value>    </Property ></configuration>

4.5 Delete./lib/hadoop-core-1.0.4.jar and copy from Hadoop

cp /usr/local/hadoop/hadoop-core-1.2.1.jar ./lib/

4.6 Starting HBase

./bin/start-hbase.sh

4.7 Verifying that HBase starts correctly

Execute the./bin/hbase Shell to start the terminal and perform the list results as follows:

hbase(main):002:0listTABLErow(sin 0.0170secondshbase(main):003:0>

4.8 HBase Master Status page

Http://localhost:60010/master-status

5 Install nutch 2.35.1 download apache-nutch-2.3-src.tar.gz and unzip to/usr/local/nutch, modify directory Permissions

sudo777 -R nutch

52 Modify./conf/gora.properties Add the following line

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

5.3 Modification./conf/nutch-site.xml

<configuration>    <property >        <name>Storage.data.store.class</name>        <value>Org.apache.gora.hbase.store.HBaseStore</value>        <description>Default class for storing data</Description>    </Property >    <property >        <name>Http.agent.name</name>        <value>My Nutch Spider</value>    </Property >    <property >        <name>Plugin.includes</name>        <value>protocol-httpclient|urlfilter-regex|index-(Basic|more) |query-(Basic|site|url|lang) |indexer-solr| nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf) | summary-basic|scoring-opic|urlnormalizer-(Pass|regex|basic) protocol-http|urlfilter-regex|parse-(html|tika| metatags) |index-(basic|anchor|more|metadata)</value>    </Property ></configuration>

5.4 Modification./ivy/ivy.xml

Change the version of dependent Hadoop-core and hadoop-test from 1.2.0 to 1.2.1
Uncomment the gora-hbase dependency as follows:

<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.5″ conf=”*->default” />

5.5 Compile: ant6 install solr 4.9.16.1 download Solr-4.9.1.tar and unzip to/USR/LOCAL/SOLR, modify directory Permissions

sudo777 -R solr

6.2 Enter/usr/local/solr/example, covering Nutch's Schema.xml execution:

cp /usr/local/nutch/runtime/local/conf/schema.xml solr/collection1/conf/schema.xml

6.3 Start SOLR

start.jar

6.4 Visit Http://localhost:8983/solr/#/collection1/query View SOLR page

7 Start crawl and test the search Effect 7.1 add crawl URL

Go to the/usr/local/nutch/runtime/local directory, create the URLs directory and create the Url.txt file content as a seed URL, such as:

http://www.cnbeta.com

7.2 Execution

./bin/crawl urls TestCrawl http://localhost:8983/solr 2

7.3 Queries

You can see the results in the Http://localhost:8983/solr/#/collection1/query search

[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support