NUTCH2.3 hadoop2.7.1 hbase1.0.1.1 solr5.2.1部署(三),hadoop2.7安裝
Precondition:
hadoop 2.7.1
hbase 0.98.13
solr 5.2.1 / Apache Solr 4.8.1
http://archive.apache.org/dist/lucene/solr/4.8.1/
gora 0.6.1
gora編譯和Nutch編譯部署
1. Gora下載
最新版本呢gora是0.6.1,下載或者直接通過git擷取 git clonehttps://github.com/apache/gora.git
2. 修改gora pom.xml
以下可能是Nutch2.3能最終啟動並執行關鍵,沒有1.0.1.1-hadoop2:)
<hadoop-1.version>1.2.1</hadoop-1.version><hadoop-2.version>2.7.1</hadoop-2.version><hadoop-1.test.version>1.2.1</hadoop-1.test.version><hadoop-2.test.version>2.7.1</hadoop-2.test.version><hbase.version>0.98.13-hadoop2</hbase.version><hbase.test.version>0.98.13-hadoop2</hbase.test.version>
3. 編譯gora
mvn clean install -DskipTests
mvn install -DskipTests
4. 修改$NUTCH_HOME/conf/nutch-site.xml
<configuration><property><name>storage.data.store.class</name><value>org.apache.gora.hbase.store.HBaseStore</value><description>Default class for storing data</description></property><property><name>http.agent.name</name><value>My Nutch Spider</value></property><property><name>plugin.includes</name><value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value></property></configuration>
5. 修改$NUTCH_HOME/ivy/ivy.xml
所有"org.apache.gora"涉及到的rev修改為0.6,例如:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> =><dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />
刪除"org.apache.hadoop",添加:
<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.7.1" conf="*->default"/>
6.修改$NUTCH_HOME/ivy/ivysettings.xml
<ivysettings> <settings defaultResolver="default"/> <property name="m2-pattern" value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]" override="false" /> <resolvers> <chain name="default"> <filesystem name="local-maven2" m2compatible="true" > <artifact pattern="${m2-pattern}"/> <ivy pattern="${m2-pattern}"/> </filesystem> <ibiblio name="central" m2compatible="true"/> </chain> </resolvers> </ivysettings>
7. $NUTCH_HOME/conf/gora.properties 添加
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
8. 根據需要修改 $NUTCH_HOME/conf/regex-urlfilter.txt $NUTCH_HOME/conf/nutch-default.xml
可以不用改
9. 編譯,要很長時間
ant runtime
10. 將gora下面的hadoop*.jar拷貝到runtime/local/lib/
cp /disk/gora/gora-core/lib/hadoop* /disk2/nutch/nutch-2.3/runtime/local/lib/
11. 建立搜尋url
mkdir urls
echo http://nutch.apache.org/ >> urls/seek.txt
12. 測試回合
cd runtime/local/
bin/nutch inject urls/seek.txt
solr5.2.1 部署運行
1. 下載解壓
2. example/example-DIH 包含了完整的solr home配置,拷貝到server/solr
cp -rf /disk2/solr/solr-5.2.1/example/example-DIH/solr/* /disk2/solr/solr-5.2.1/server/solr/
3. 解決Nutch運行中可能遇到的Error 404: Prob accessing /solr/solr/update. Reason: Not Found
cd /disk2/solr/solr-5.2.1/server/solr
cp /disk2/solr/solr-5.2.1/example/exampledocs/monitor.xml .
curl http://127.0.0.1:8983/solr/solr/update --data-binary @monitor.xml -H 'Content-type:application/xml'
3. 為nutch crawl運行,還要修改/disk2/solr/solr-5.2.1/server/solr/solr/conf/schema.xml,加上:
<field name="host" type="string" stored="false" indexed="true"/><field name="site" type="string" stored="false" indexed="true"/><field name="cache" type="string" stored="true" indexed="false"/><field name="digest" type="string" stored="true" indexed="false"/><field name="segment" type="string" stored="true" indexed="false"/><field name="boost" type="float" stored="true" indexed="false"/><field name="tstamp" type="date" stored="true" indexed="false"/><field name="stamp" type="date" stored="true" indexed="false"/> <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>
4. bin/solr start
5. http://192.168.1.106:8983/solr
6. bin/crawl urls/seek.txt TestCrawl http://192.168.1.106:8983/solr/solr 2
FAQ
下面是過程中遇到的讓人憤怒的。。。
1. 錯誤: 找不到或無法載入主類 org.apache.nutch.crawl.InjectorJob:
沒有ant runtime
2. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
nutch2.3 需要使用hbase 0.98.13 的幾個hbase-comm*.jar / hbase-client*.jar / hbase-protocol*.jar,千萬不要用hbase1.0.1.1的。
cd /disk2/hbase/hbase-0.98.13-hadoop2/lib
cp hbase-common* /disk2/nutch/nutch-2.3/runtime/local/lib/
cp hbase-client-0.98.13-hadoop2.jar /disk2/nutch/nutch-2.3/runtime/local/lib/
cp hbase-protocol* /disk2/nutch/nutch-2.3/runtime/local/lib/
3. Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
原因同上,hbase 和 nutch不匹配
4. 2015-07-21 13:53:53,238 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
就讓他native好了
mkdir -p /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cd /disk2/hadoop/hadoop-2.7.1/lib/native/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/
著作權聲明:本文為博主原創文章,未經博主允許不得轉載。