nutch2.1抓取中文網站

來源:互聯網
上載者:User

標籤:des   style   code   c   tar   ext   

對nutch添加中文網站抓取功能。

1、中文網頁抓取

    A、調整mysql配置,避免存入mysql的中文出現亂碼。修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/gora.properties

     

###############################

# MySQL properties            #

###############################

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

gora.sqlstore.jdbc.url=jdbc:mysql://10.10.11.252:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull

gora.sqlstore.jdbc.user=devuser

gora.sqlstore.jdbc.password=devuser

    B、修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/nutch-site.xml檔案

        <property>

<name>http.accept.language</name>

<value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value>

<description>Value of the “Accept-Language” request header field.

   This allows selecting non-English language as default one to retrieve.

   It is a useful setting for search engines build for certain national group.

</description>

    </property>

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.