NUTCH2.3 hadoop2.7.1 hbase1.0.1.1 solr5.2.1部署(三),hadoop2.7安裝

來源:互聯網
上載者:User

NUTCH2.3 hadoop2.7.1 hbase1.0.1.1 solr5.2.1部署(三),hadoop2.7安裝


Precondition:

hadoop 2.7.1
hbase 0.98.13
solr 5.2.1 / Apache Solr 4.8.1
http://archive.apache.org/dist/lucene/solr/4.8.1/
gora 0.6.1


gora編譯和Nutch編譯部署

1. Gora下載

最新版本呢gora是0.6.1,下載或者直接通過git擷取 git clonehttps://github.com/apache/gora.git

2.  修改gora pom.xml

以下可能是Nutch2.3能最終啟動並執行關鍵,沒有1.0.1.1-hadoop2:)

<hadoop-1.version>1.2.1</hadoop-1.version><hadoop-2.version>2.7.1</hadoop-2.version><hadoop-1.test.version>1.2.1</hadoop-1.test.version><hadoop-2.test.version>2.7.1</hadoop-2.test.version><hbase.version>0.98.13-hadoop2</hbase.version><hbase.test.version>0.98.13-hadoop2</hbase.test.version>

3. 編譯gora

mvn clean install -DskipTests
mvn install -DskipTests

4. 修改$NUTCH_HOME/conf/nutch-site.xml

<configuration><property><name>storage.data.store.class</name><value>org.apache.gora.hbase.store.HBaseStore</value><description>Default class for storing data</description></property><property><name>http.agent.name</name><value>My Nutch Spider</value></property><property><name>plugin.includes</name><value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value></property></configuration>

5. 修改$NUTCH_HOME/ivy/ivy.xml

所有"org.apache.gora"涉及到的rev修改為0.6,例如:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> =><dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />

刪除"org.apache.hadoop",添加:

<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.7.1" conf="*->default"/> 

6.修改$NUTCH_HOME/ivy/ivysettings.xml

<ivysettings>     <settings defaultResolver="default"/>     <property name="m2-pattern" value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]" override="false" />     <resolvers>         <chain name="default">             <filesystem name="local-maven2" m2compatible="true" >                 <artifact pattern="${m2-pattern}"/>                 <ivy pattern="${m2-pattern}"/>             </filesystem>             <ibiblio name="central" m2compatible="true"/>         </chain>     </resolvers> </ivysettings> 

7. $NUTCH_HOME/conf/gora.properties 添加

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
8. 根據需要修改 $NUTCH_HOME/conf/regex-urlfilter.txt $NUTCH_HOME/conf/nutch-default.xml

可以不用改

9. 編譯,要很長時間

ant runtime

10. 將gora下面的hadoop*.jar拷貝到runtime/local/lib/

cp /disk/gora/gora-core/lib/hadoop* /disk2/nutch/nutch-2.3/runtime/local/lib/

11. 建立搜尋url

mkdir urls
echo http://nutch.apache.org/ >> urls/seek.txt

12. 測試回合

cd runtime/local/

bin/nutch inject urls/seek.txt


solr5.2.1 部署運行

1. 下載解壓

2. example/example-DIH 包含了完整的solr home配置,拷貝到server/solr

cp -rf /disk2/solr/solr-5.2.1/example/example-DIH/solr/* /disk2/solr/solr-5.2.1/server/solr/

3. 解決Nutch運行中可能遇到的Error 404: Prob accessing /solr/solr/update. Reason: Not Found

cd /disk2/solr/solr-5.2.1/server/solr

cp /disk2/solr/solr-5.2.1/example/exampledocs/monitor.xml .

curl http://127.0.0.1:8983/solr/solr/update --data-binary @monitor.xml -H 'Content-type:application/xml'

3. 為nutch crawl運行,還要修改/disk2/solr/solr-5.2.1/server/solr/solr/conf/schema.xml,加上:

<field name="host" type="string" stored="false" indexed="true"/><field name="site" type="string" stored="false" indexed="true"/><field name="cache" type="string" stored="true" indexed="false"/><field name="digest" type="string" stored="true" indexed="false"/><field name="segment" type="string" stored="true" indexed="false"/><field name="boost" type="float" stored="true" indexed="false"/><field name="tstamp" type="date" stored="true" indexed="false"/><field name="stamp" type="date" stored="true" indexed="false"/>  <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>  
4. bin/solr start

5. http://192.168.1.106:8983/solr

6. bin/crawl urls/seek.txt TestCrawl http://192.168.1.106:8983/solr/solr 2


FAQ

下面是過程中遇到的讓人憤怒的。。。

1. 錯誤: 找不到或無法載入主類 org.apache.nutch.crawl.InjectorJob:
沒有ant runtime

2. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

nutch2.3 需要使用hbase 0.98.13 的幾個hbase-comm*.jar / hbase-client*.jar / hbase-protocol*.jar,千萬不要用hbase1.0.1.1的。
cd /disk2/hbase/hbase-0.98.13-hadoop2/lib
cp hbase-common* /disk2/nutch/nutch-2.3/runtime/local/lib/

cp hbase-client-0.98.13-hadoop2.jar /disk2/nutch/nutch-2.3/runtime/local/lib/
cp hbase-protocol* /disk2/nutch/nutch-2.3/runtime/local/lib/

3. Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
原因同上,hbase 和 nutch不匹配

4. 2015-07-21 13:53:53,238 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

就讓他native好了
mkdir -p /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cd /disk2/hadoop/hadoop-2.7.1/lib/native/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/




著作權聲明:本文為博主原創文章,未經博主允許不得轉載。

相關文章

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.