Setting up Nutch 2.1 with MySQL to handle UTF-8

Last Update:2014-06-28 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://nlp.solutions.asia/?p=180

These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and java_home configured.

Install MySQL Server and MySQL Client using the Ubuntu Software Center or at the sudo apt-get install mysql-server mysql-client command line.

As MySQL defaults to Latin (is we still in the 1990s?) we need to edit and sudo vi /etc/mysql/my.cnf under [mysqld] Add

Innodb_file_format=barracuda
Innodb_file_per_table=true
Innodb_large_prefix=true
Character-set-server=utf8mb4
Collation-server=utf8mb4_unicode_ci
max_allowed_packet=500m

The InnoDB options deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The Max_allowed_packet option is so you don't run into issues as your database and the pages of your store in it get larger.

Check to make sure MySQL are running by typing and you sudo netstat -tap | grep mysql should see something like

TCP 0 0 Localhost:mysql *:* LISTEN

We need to set up the Nutch database manually as the current Nutch/gora/mysql generated DB schema defaults to Latin. Log into MySQL on the command line using your previously set up MySQL ID and password type

mysql -u xxxxx -p

Then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;

Then type Enter. You are doing setting up the MySQL database for Nutch.

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we'll refer to this folder as ${apache_nutch_home}.

From inside the Nutch folder ensure the MySQL dependency for Nutch are available by editing the following in ${apache_nutch _home}/ivy/ivy.xml

<!–uncomment the use of MySQL as database with SQL as Gora store. –>
<dependency org= "MySQL" name= "Mysql-connector-java" rev= "5.1.18″conf=" *->default

Edit the ${apache_nutch_home}/conf/gora.properties file either deleting or commenting out the Default Sqlstore properties Using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL Properties #
###############################
Gora.sqlstore.jdbc.driver=com.mysql.jdbc.driver
Gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createdatabaseifnotexist=true
Gora.sqlstore.jdbc.user=xxxxx
Gora.sqlstore.jdbc.password=xxxxx

Edit the ${apache_nutch_home}/conf/gora-sql-mapping.xml file changing the length of the PrimaryKey from $767 in both Places.
<primarykey column= "id" length= "767″/>

Configure ${apache_nutch_home}/conf/nutch-site.xml to put in a name in the Value field under Http.agent.name. It can anything but cannot is left blank. Add additional languages If you want (I had added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>your Nutch spider</value>
</property>

<property>
<name>http.accept.language</name>
<VALUE>JA-JP, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>value of the "Accept-language" Request header field.
This allows selecting Non-english language as the default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>the character encoding to fall back to when no other information
Is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>the Gora DataStore class for storing and retrieving data.
Currently the following stores is available: ....
</description>
</property>

Install ant using the Ubuntu Software Center or at the sudo apt-get install ant command line.

From the command line to cd your Nutch folder typeant runtime
This is a few minutes to compile.

Start Your first crawl by typing the lines below in the terminal (replace ' http://nutch.apache.org/' with Whateve R site want to crawl):
cd ${APACHE_NUTCH_HOME}/runtime/local mkdir -p urls echo ‘http://nutch.apache.org/‘ > urls/seed.txt bin/nutch crawl urls -depth 3 -topN 5

You can easily add more URLs to search by hand in Seed.txt if you want. For the crawl, depth are the number of rounds of generate/fetch/parse/update you want to does (not depth of links as You might think at first) and TopNare the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually pars Es to TopN so don ' t is surprised by seeing many more rows in the webpage table than your expect by limiting with TopN).

Check Your crawl results by looking on the webpage table in the Nutch database.
mysql -u xxxxx -p use nutch; SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It'll be hard-read the columns so want-install MySQL Workbench via and use this sudo apt-get install mysql-workbench instead for viewing The data. Also want to run the following SQL command to limit the "rows in the" select * from webpage where status = 2; webpage table to only URLs that were act Ually parsed.

Set up and index with SOLR If you is using Nutch 2.1 at the bleeding edge and probably want the latest version of SOLR 4.0 as Well. Untar it to $HOME/apache-solr-4.0.0-xxxx. This folder would be is now referred to as ${apache_solr_home}.
Download Http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml and use it to replace ${apache_solr_home}/ Example/solr/collection1/conf/schema.xml.

From the terminal start SOLR:
cd ${APACHE_SOLR_HOME}/example java -jar start.jar

You can check the running by opening HTTP://LOCALHOST:8983/SOLR in your Web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/ bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using SOLR versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you had crawled nutch.apache.org in the input box titled "Q" You can does a search by inputting text:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but is at least started.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More