nutch 0.9在Windows下的安裝【zz】

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

一、環境：
      1.作業系統：windowsXp,windows2000+
      2.java1.6，設定JAVA_HOME到環境變數
      3.cygwin,當然這個不是必需的，只是nutch提供的指令碼只能在shell環境下使用，所以使用cygwin來虛擬shell命令。
      4.nutch版本：0.9
      5.tomcat：6.0

二、nutch的安裝和配置：

1，安裝Cygwin1.5.5（我這裡裝到F:"cygSys）,將nutch解壓縮後放置到cygSys "home"使用者名稱的一個目錄下(我放在F:"cygSys"home"dyk"nutch下),

2，在Cygwin環境下進入nutch-0.9目錄下，使用命令 bin/nutch進行測試，正常的情況下出現的結果是：

3，進行抓取網站的測試，以抓取http://www.163.com/為例

1) 建立一個檔案myurl,在檔案中輸入http://www.163.com/儲存，這個檔案可以放在任何地方（我這個檔案放在F:"cygSys"home"dyk"nutch"myurl）,另外再建立一個爬蟲日誌目錄logs(我放在F:"cygSys"home"dyk"nutch"logs)

2) 開啟nutch-0.9"conf"nutch-site.xml檔案，在<configuration></configuration>插入入以下內容：

<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

     http.robots.agents
     http.agent.description
     http.agent.url
     http.agent.email
     http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

可以把<name>XXX</name>之間的內容替換為其他字元，當然就算是不替換也無所謂，這裡的設定，是因為nutch遵守了robots協議，在擷取response時，把自己的相關資訊提交給被爬行的網站，以供識別。

3) 開啟nutch-0.9"conf"crawl-urlfilter.txt檔案，把MY.DOMAIN.NAME字元替換為myurl內的網域名稱（比如我改成了“+^http://([a-z0-9]*".)*163.com/”，其實更簡單點，直接刪除MY.DOMAIN.NAME這幾個字就可以了，也就是說，只儲存+^http://([a-z0-9]*".)*這幾個字就可以了，表示所有http的網站都同意爬行）。

4) 運行爬蟲,在Cygwin輸入以下命令：

bin/nutch crawl ../myurl –dir ../mydir –depth 2 >&../logs/crawl1.log

這裡dir表示儲存的目錄，-depth表示網址爬的深度，最後是指明記錄檔

運行結束後，你可以開啟記錄檔查看爬蟲啟動並執行詳細過程。

5，在tomcat上運行Nutch

把nutch-0.9.war拷貝到Tomcat"webapps"下面

在瀏覽器中輸入http://localhost:8080/nutch-0.9/這步是為了使tomcat展開nutch-0.9.war，然後修改webapps/ nutch-0.9/WEB-INF/classes/nutch-site.xml檔案如下：

<configuration>
<property>
<name>searcher.dir</name>
<value>F:""cygSys""home""dyk""nutch""mydir4</value>
</property>
</configuration>

為了支援中文的搜尋，修改Tomcat"conf"server.xml。找到對應的地方修改成

在瀏覽器中輸入http://localhost:8080/nutch-0.9，

搜尋“nba”，結果是

ps:這篇講nutch安裝的文章寫得還不錯，基本上按照他說的一步一步來可以把nutch搭建起來。
但是裡面有幾個要注意的就是:
1.執行crawl命令前需要先用export NUTCH_JAVA_HOME=/cygdrive/c//"Program Files"//Java//jdk-xxxx
2.可能就是cygwin的使用了，很多人對這個不熟悉，但是對於普通目錄變更實際上和dos裡面一樣
3.上面第一個nutch-site.xml檔案的配置需要輸入agent value,不然過程中可能會jobs會失敗

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More