[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

Source: Internet
Author: User
Tags create directory solr zookeeper hadoop fs

The previous blog post describes the development environment under the Windows 10 system using Cygwin to build nutch, this article will introduce Nutch2.3 under the Ubuntu environment.

1. Required software and its version
    • Ubuntu 15.04
    • Hadoop 1.2.1
    • HBase 0.94.27
    • Nutch 2.3
    • SOLR 4.9.1
2. System Environment Preparation 2.1 installing Ubuntu operating system

Basic requirements, there are many online, self-installation, there are questions can leave a message.

2.2 Create a separate Kandy user
useradd kandy
2.3 Setting a password
passwd kandy
2.4 Turn on Administrator privileges
vi /etc/sudoers

Add a line:

ALL=(ALLALL
2.5 Ensure that the configuration localhost address is mapped to 127.0.0.1
vi /etc/hosts

Add the following content:

127.0.0.1 localhost
2.6 Set No password login, to SSH localhost can be implemented without password login 2.6.1 generated key

Use the following command:

ssh-keygen-t rsa

Use all default options:

2.6.2 Copying keys

Use the following command:

cp  .ssh/id_rsa.pub  .ssh/authorized_keys
2.6.3 Verification

Enter the following command:

ssh localhost
2.7 Ensure that the JDK is installed and configured with Java_home environment variables, such as:
JAVA_HOME=/usr/lib/jvm/java-8-oracle
3 install Hadoop 1.2.13.1 download hadoop-1.2.1.tar.gz and unzip to/usr/local/hadoop, modify directory permissions:
sudo777 -R hadoop
3.2 Create/data/hadoop-data Directory 3.3 modify./conf/core-site.xml is configured as follows:
<configuration>    <property >        <name>Hadoop.tmp.dir</name>        <value>/data/hadoop-data/tmp</value>        <description>Abase for other temporary directories.</Description>    </Property >    <property >        <name>Fs.default.name</name>        <value>hdfs://localhost:9000</value>    </Property ></configuration>
3.4 Modify the Java_home configuration in conf/hadoop-env.sh as follows:
JAVA_HOME=/usr/lib/jvm/java-8-oracle
3.5 Modify the./conf/hdfs-site.xml configuration as follows:
<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property></configuration>
3.6 Modify the./conf/mapred-site.xml configuration as follows:
<configuration>    <property>        <name>mapred.job.tracker</name>        <value>localhost:9001</value>    </property></configuration>
3.7 Configuring environment variables
export HADOOP_PREFIX=/usr/local/hadoop/export PATH=${HADOOP_PREFIX}/bin/:${PATH}
3.8 Formatting Namenode
hadoop namenode -format
3.9 Starting Hadoop
start-all.sh
3.10 Check if Hadoop starts correctly
hadoop fs -ls /
3.11 Tasktracker Status Page

http://localhost:50060/tasktracker.jsp

3.12 Jobtracker Status Page

http://localhost:50030/jobtracker.jsp

3.13 Datanode Status Page

http://localhost:50075

4 install hbase 0.94.274.1 download hbase-0.94.27.tar.gz and unzip to/usr/local/hbase, modify directory Permissions
sudo777 -R hbase
4.2 Create directory/data/hbase/zookeeper/4.3 modify the./conf/hbase-env.sh in the Java_home configuration as follows
JAVA_HOME=/usr/lib/jvm/java-8-oracle

and open:

export HBASE_MANAGES_ZK=true
4.4 Modification./conf/hbase-site.xml
<configuration>    <property >        <name>Hbase.rootdir</name>        <value>Hdfs://localhost:9000/hbase</value>    </Property >    <property >        <name>hbase.cluster.distributed</name>        <value>True</value>    </Property >    <property >        <name>Hbase.zookeeper.property.dataDir</name>        <value>/data/hbase/zookeeper</value>    </Property ></configuration>
4.5 Delete./lib/hadoop-core-1.0.4.jar and copy from Hadoop
cp /usr/local/hadoop/hadoop-core-1.2.1.jar ./lib/
4.6 Starting HBase
./bin/start-hbase.sh
4.7 Verifying that HBase starts correctly

Execute the./bin/hbase Shell to start the terminal and perform the list results as follows:

hbase(main):002:0listTABLErow(sin 0.0170secondshbase(main):003:0>
4.8 HBase Master Status page

Http://localhost:60010/master-status

5 Install nutch 2.35.1 download apache-nutch-2.3-src.tar.gz and unzip to/usr/local/nutch, modify directory Permissions
sudo777 -R nutch
52 Modify./conf/gora.properties Add the following line
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
5.3 Modification./conf/nutch-site.xml
<configuration>    <property >        <name>Storage.data.store.class</name>        <value>Org.apache.gora.hbase.store.HBaseStore</value>        <description>Default class for storing data</Description>    </Property >    <property >        <name>Http.agent.name</name>        <value>My Nutch Spider</value>    </Property >    <property >        <name>Plugin.includes</name>        <value>protocol-httpclient|urlfilter-regex|index-(Basic|more) |query-(Basic|site|url|lang) |indexer-solr| nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf) | summary-basic|scoring-opic|urlnormalizer-(Pass|regex|basic) protocol-http|urlfilter-regex|parse-(html|tika| metatags) |index-(basic|anchor|more|metadata)</value>    </Property ></configuration>
5.4 Modification./ivy/ivy.xml

Change the version of dependent Hadoop-core and hadoop-test from 1.2.0 to 1.2.1
Uncomment the gora-hbase dependency as follows:

<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.5″ conf=”*->default” />
5.5 Compile: ant6 install solr 4.9.16.1 download Solr-4.9.1.tar and unzip to/USR/LOCAL/SOLR, modify directory Permissions
sudo777 -R solr
6.2 Enter/usr/local/solr/example, covering Nutch's Schema.xml execution:
cp /usr/local/nutch/runtime/local/conf/schema.xml solr/collection1/conf/schema.xml
6.3 Start SOLR
start.jar
6.4 Visit Http://localhost:8983/solr/#/collection1/query View SOLR page

7 Start crawl and test the search Effect 7.1 add crawl URL

Go to the/usr/local/nutch/runtime/local directory, create the URLs directory and create the Url.txt file content as a seed URL, such as:

http://www.cnbeta.com
7.2 Execution
./bin/crawl urls TestCrawl http://localhost:8983/solr 2
7.3 Queries

You can see the results in the Http://localhost:8983/solr/#/collection1/query search

[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.