solrUrl is not set, indexing will be skipped...
crawl started in: crwal
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 2
Injector: starting at 2012-04-20 14:39:30
Injector: crawlDb: crwal/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
java.lang.RuntimeException: Error in configuring object
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
atorg.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
atjava.lang.reflect.Method.invoke(Unknown Source)
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
...5 more
Caused by: java.lang.RuntimeException: Error in configuring object
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
atorg.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
atorg.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
atorg.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
...10 more
Caused by: java.lang.reflect.InvocationTargetException
atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
atsun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
atjava.lang.reflect.Method.invoke(Unknown Source)
atorg.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
...13 more
Caused by: java.lang.IllegalArgumentException: plugin.folders is not defined
atorg.apache.nutch.plugin.PluginManifestParser.parsePluginFolder(PluginManifestParser.java:78)
atorg.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:72)
atorg.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
atorg.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:117)
atorg.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
...18 more
12/04/20 10:14:44 INFOmapred.JobClient: map 0% reduce 0%
12/04/20 10:14:44 INFOmapred.JobClient: Job complete: job_local_0001
12/04/20 10:14:44 INFOmapred.JobClient: Counters: 0
Exception in thread"main" java.io.IOException: Job failed!
atorg.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
atorg.apache.nutch.crawl.Injector.inject(Injector.java:217)
atorg.apache.nutch.crawl.Crawl.run(Crawl.java:127)
atorg.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
atorg.apache.nutch.crawl.Crawl.main(Crawl.java:55)
首先不要怪我貼了這麼多的錯誤資訊,只是為了讓大家更容易找到這裡而已。
解決這個問題就是將nutch-default.xml中的
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
紅色處改一下就可以了。
祝大家好運哦。
補充一下將nutch運行在eclipse上的步驟,搞了一天才搞通,不過要謝謝北北同學。哈哈
http://wiki.apache.org/nutch/RunNutchInEclipse 英語權威 處
做好準備工作
1、安裝subeclpse外掛程式,安裝ivyDE外掛程式,安裝maven外掛程式
2、check出代碼 https://svn.apache.org/repos/asf/nutch/trunk
3、刪除src,然後將src/bin,src/java,src/test,src/testsource,src/plugin/xx/src/java,src/plugin/xx/src/test作為folder
4、加上兩jar包,看英文能看懂的
5、在libraries分頁上,右邊點擊Add Class Floder 選中nutch的conf.
6、還是在libraries分頁上,右擊Add Library > IvyDE Managed Dependencies > 選ivy/ivy.xml
7、build.xml----ant一下
8、重新整理一下nutch工程,在conf下增加了nutch-site.xml,regex-urlfilter.xml,配置內容
9、在nutch-default.xml中修改
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
很關鍵
10、在根目錄下建一個檔案夾urls,檔案夾下seed.txt,seed.txt中寫要抓取頁面的網址
11、build.xml 再次編譯(ant)
12、執行