The previous blog post describes the development environment under the Windows 10 system using Cygwin to build nutch, this article will introduce Nutch2.3 under the Ubuntu environment.
1. Required software and its version
- Ubuntu 15.04
- Hadoop 1.2.1
- HBase 0.94.27
- Nutch 2.3
- SOLR 4.9.1
2. System Environment Preparation 2.1 installing Ubuntu operating system
Basic requirements, there are many online, self-installation, there are questions can leave a message.
2.2 Create a separate Kandy user
useradd kandy
2.3 Setting a password
passwd kandy
2.4 Turn on Administrator privileges
vi /etc/sudoers
Add a line:
ALL=(ALLALL
2.5 Ensure that the configuration localhost address is mapped to 127.0.0.1
vi /etc/hosts
Add the following content:
127.0.0.1 localhost
2.6 Set No password login, to SSH localhost can be implemented without password login 2.6.1 generated key
Use the following command:
ssh-keygen-t rsa
Use all default options:
2.6.2 Copying keys
Use the following command:
cp .ssh/id_rsa.pub .ssh/authorized_keys
2.6.3 Verification
Enter the following command:
ssh localhost
2.7 Ensure that the JDK is installed and configured with Java_home environment variables, such as:
JAVA_HOME=/usr/lib/jvm/java-8-oracle
3 install Hadoop 1.2.13.1 download hadoop-1.2.1.tar.gz and unzip to/usr/local/hadoop, modify directory permissions:
sudo777 -R hadoop
3.2 Create/data/hadoop-data Directory 3.3 modify./conf/core-site.xml is configured as follows:
<configuration> <property > <name>Hadoop.tmp.dir</name> <value>/data/hadoop-data/tmp</value> <description>Abase for other temporary directories.</Description> </Property > <property > <name>Fs.default.name</name> <value>hdfs://localhost:9000</value> </Property ></configuration>
3.4 Modify the Java_home configuration in conf/hadoop-env.sh as follows:
JAVA_HOME=/usr/lib/jvm/java-8-oracle
3.5 Modify the./conf/hdfs-site.xml configuration as follows:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>
3.6 Modify the./conf/mapred-site.xml configuration as follows:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>
3.7 Configuring environment variables
export HADOOP_PREFIX=/usr/local/hadoop/export PATH=${HADOOP_PREFIX}/bin/:${PATH}
3.8 Formatting Namenode
hadoop namenode -format
3.9 Starting Hadoop
start-all.sh
3.10 Check if Hadoop starts correctly
hadoop fs -ls /
3.11 Tasktracker Status Page
http://localhost:50060/tasktracker.jsp
3.12 Jobtracker Status Page
http://localhost:50030/jobtracker.jsp
3.13 Datanode Status Page
http://localhost:50075
4 install hbase 0.94.274.1 download hbase-0.94.27.tar.gz and unzip to/usr/local/hbase, modify directory Permissions
sudo777 -R hbase
4.2 Create directory/data/hbase/zookeeper/4.3 modify the./conf/hbase-env.sh in the Java_home configuration as follows
JAVA_HOME=/usr/lib/jvm/java-8-oracle
and open:
export HBASE_MANAGES_ZK=true
4.4 Modification./conf/hbase-site.xml
<configuration> <property > <name>Hbase.rootdir</name> <value>Hdfs://localhost:9000/hbase</value> </Property > <property > <name>hbase.cluster.distributed</name> <value>True</value> </Property > <property > <name>Hbase.zookeeper.property.dataDir</name> <value>/data/hbase/zookeeper</value> </Property ></configuration>
4.5 Delete./lib/hadoop-core-1.0.4.jar and copy from Hadoop
cp /usr/local/hadoop/hadoop-core-1.2.1.jar ./lib/
4.6 Starting HBase
./bin/start-hbase.sh
4.7 Verifying that HBase starts correctly
Execute the./bin/hbase Shell to start the terminal and perform the list results as follows:
hbase(main):002:0listTABLErow(sin 0.0170secondshbase(main):003:0>
4.8 HBase Master Status page
Http://localhost:60010/master-status
5 Install nutch 2.35.1 download apache-nutch-2.3-src.tar.gz and unzip to/usr/local/nutch, modify directory Permissions
sudo777 -R nutch
52 Modify./conf/gora.properties Add the following line
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
5.3 Modification./conf/nutch-site.xml
<configuration> <property > <name>Storage.data.store.class</name> <value>Org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</Description> </Property > <property > <name>Http.agent.name</name> <value>My Nutch Spider</value> </Property > <property > <name>Plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|index-(Basic|more) |query-(Basic|site|url|lang) |indexer-solr| nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf) | summary-basic|scoring-opic|urlnormalizer-(Pass|regex|basic) protocol-http|urlfilter-regex|parse-(html|tika| metatags) |index-(basic|anchor|more|metadata)</value> </Property ></configuration>
5.4 Modification./ivy/ivy.xml
Change the version of dependent Hadoop-core and hadoop-test from 1.2.0 to 1.2.1
Uncomment the gora-hbase dependency as follows:
<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.5″ conf=”*->default” />
5.5 Compile: ant6 install solr 4.9.16.1 download Solr-4.9.1.tar and unzip to/USR/LOCAL/SOLR, modify directory Permissions
sudo777 -R solr
6.2 Enter/usr/local/solr/example, covering Nutch's Schema.xml execution:
cp /usr/local/nutch/runtime/local/conf/schema.xml solr/collection1/conf/schema.xml
6.3 Start SOLR
start.jar
6.4 Visit Http://localhost:8983/solr/#/collection1/query View SOLR page
7 Start crawl and test the search Effect 7.1 add crawl URL
Go to the/usr/local/nutch/runtime/local directory, create the URLs directory and create the Url.txt file content as a seed URL, such as:
http://www.cnbeta.com
7.2 Execution
./bin/crawl urls TestCrawl http://localhost:8983/solr 2
7.3 Queries
You can see the results in the Http://localhost:8983/solr/#/collection1/query search
[Nutch] NUTCH2.3+HADOOP+HBASE+SOLR in Ubuntu Environment