Nutch Quick Start (Nutch 2.2.1+HBASE+SOLR)

Last Update:2015-01-08 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.tuicool.com/articles/VfEFjm

Nutch 2.x compared to Nutch 1.x, stripped out of the storage layer, put into the Gora, you can use a variety of databases, such as HBase, Cassandra, MySQL to store data. Nutch 1.7 is to store data directly on HDFs.

1. Install and run HBase

For simplicity, use standalone mode, refer to HBase Quick start

1.1 Downloads, Unzip

http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gztar zxf hbase-0.90.4.tar.gz

1.2 Modifying Conf/hbase-site.xml

The contents are as follows

<Configuration><property> <name> Hbase.rootdir</name> < Value>file:///directory/hbase</value> </< Span class= "title" >property> <property> << Span class= "title" >name>hbase.zookeeper.property.datadir</name> span class= "tag" ><value>/directory/zookeeper</ value> </property></< Span class= "title" >CONFIGURATION>

hbase.rootdirThe directory is used to store information about HBase, and the default value is that the /tmp/hbase-${user.name}/hbase hbase.zookeeper.property.dataDir directory is used to store information about zookeeper (HBase has built-in zookeeper), and the default value is /tmp/hbase-${user.name}/zookeeper .

1.3 Start

$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out

1.4 Try the shell.

$./bin/hbase Shell HBase Shell; Enter ' help ' for list of supported commands. Type "Exit" to leave the HBase Shell Version 0.90.4, r1150278, Sun Jul 15:53:29 PDT 2011

HBase (Main):001:0>

Create a table with a name, with test only one column, named cf . To verify that the creation was successful, use the list command to view all the table and put insert some values with the command.

HBase (Main):003:0>Create' Test ',' CF '0 row (s)Inch1.2200 secondshbase (Main):003:0> List' Test '.1 row (s)Inch0.0550 secondshbase (Main):004:0> put  ' Row1 ',  ' cf:a ',  ' value1 ' 0 row (s) in Span class= "number" >0.0560 secondshbase (Main): 005:0> put  ' Row2 ',  ' cf:b ',  ' value2 ' 0 row (s) in 0.0370 secondshbase (Main):  006:0> put  ' test ',  ' row3 ',  ' value3 ' 0 row (s) in  0.0450 seconds

scanScan the table with the command to verify that the insertion was successful.

hbase(main):007:0> scan ‘test‘ROW        COLUMN+CELLrow1       column=cf:a, timestamp=1288380727188, value=value1row2 column=cf:b, timestamp=1288380738440, value=value2row3 column=cf:c, timestamp=1288380747365, value=value33 row(s) in 0.0590 seconds

Now, disable and drop your watch, which will clear all of the above operations.

drop ‘test‘0 row(s) in 0.0770 seconds

Exit the shell,

hbase(main):014:0> exit

1.5 Stop

$ ./bin/stop-hbase.shstopping hbase...............

1.6 Start again

After running Nutch, you need to store the data in HBase, so you need to start hbase.

$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out

2 compiling nutch 2.2.12.1 download, unzip

http://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gztar zxf apache-nutch-2.2.1-src.tar.gz

2.2 Modifying a configuration file

Reference Nutch 2.0 Tutorial

Modifyconf/nutch-site.xml

<property>  <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description></property>

Modifyivy/ivy.xml

<!-- Uncomment this to use HBase as Gora backend. --><dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

Modified to conf/gora.properties ensure that HBaseStore it is set as the default storage,

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

2.3 Compiling

ant runtime

A lot of jars will be downloaded at the beginning, and it will take some time to wait.

It is possible that you will get the following error:

Trying to override old definition of task javac  load definitions from resource org/sonar/ant/antlib.xml. It could not be found.ivy-probe-antlib:ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

It doesn't matter, don't care about it.

It will take a while for the compilation to finish. After compiling, more than two folders of build and runtime are out.

Steps 3rd, 4, 5, and 6 are exactly the same as the 3rd, 4, 5, 6 step in another blog Nutch QuickStart (Nutch 1.7).

3 Adding a Seed URL

mkdir ~/urlsvim ～/urls/seed.txthttp://movie.douban.com/subject/5323968/

4 setting URL filtering rules

If you just want to crawl some type of URL, you can conf/regex-urlfilter.txt set a regular expression, so only URLs that match these regular expressions will be crawled.

For example, I just want to crawl the data of a watercress movie, which can be set:

#注释掉这一行# skip URLs containing certain characters as probable queries, etc.#-[?*[email protected]=]# accept anything else#注释掉这行#+.+^http:\/\/movie\.douban\.com\/subject\/[0-9]+\/(\?.+)?$

5 Setting the agent name

Conf/nutch-site.xml:

<property>  <name>http.agent.name</name> <value>My Nutch Spider</value></property>

This step is seen from this book, Web crawling and Data Mining with Apache Nutch, page 14th.

6 Installing SOLR

Because SOLR is required to build the index, we need to install and start a SOLR server.

Refer to Nutch Tutorial 4th, 5, 6 steps, and SOLR Tutorial.

6.1 Downloads, unzip

wget http://mirrors.cnnic.cn/apache/lucene/solr/4.6.1/solr-4.6.1.tgz tar-zxf solr-4.6.1.tgz

6.2 Running SOLR

start.jar

Verify that startup is successful

Open http://localhost:8983/solr/admin/with a browser, if you can see the page, the start success.

6.3 Integrating Nutch with SOLR

NUTCH_DIR/conf/schema-solr4.xmlCopy to SOLR_DIR/solr/collection1/conf/ , rename to Schema.xml, and add a line at the <fields>...</fields> end (see SOLR 4.2-what is _version_field for details),

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>

Restart SOLR,

start.jar

The 7th and 8th steps are similar to the 7 and 8 steps in the Nutch 1.7 blog post. The main difference is that all of the data in the Nutch 2.x is no longer stored as files and directories on the hard disk, but in HBase.

71 Steps to crawl a Web page with a single command

Todo

8 using crawl script one-click Crawl

Just now we are manually typing multiple commands, a step, to complete the crawl, in fact, Nutch brought a script, the ./bin/crawl various steps of the crawl into a command, look at its usage

<seedDir> <crawlID> <solrURL> <numberOfRounds>

Note that here is crawlId , no longer is crawlDir .

Delete the data from section 7th and use the HBase shell to disable delete the table.

8.1 Crawling Web pages

$ ./bin/crawl ~/urls/ TestCrawl http://localhost:8983/solr/ 2

～/urlsis the directory where the seed URL is stored
Testcrawl is Crawlid, which creates a crawlid-prefixed table in hbase, such as Testcrawl_webpage.
http://localhost:8983/solr/, this is the SOLR server
2,numberofrounds, Number of iterations

After a while, the screen appeared a lot of URLs, you can see the crawler is crawling!

Fetching http:music.douban.com/subject/25811077/(Queue crawl delay=5000ms) fetching http:read.douban.com/ebook/1919781 (Queue crawl delay=5000ms) fetching http:www.douban.com/online/11670861/(Queue crawl delay=5000ms) fetching http:book.douban.com/tag/(Queue crawl delay=5000ms) fetching http:movie.douban.com/tag/Sci Fi (queue crawl delay=5000ms)49/Spinwaiting/active,Pages,0 errors,0.91 pages/s,332 245 kb/s, 131 URLs in 5 queuesfetching http://music.douban.com/subject/25762454/(queue crawl delay= 5000MS) fetching Http://read.douban.com/reader/ebook/1951242/(queue crawl delay=5000ms) Fetching Http://www.douban.com/mobile/read-notes (queue crawl delay=5000ms) fetching http:< span class= "comment" >//book.douban.com/tag/poetry (Queue crawl delay=5000ms) 50/ 50 spinwaiting/active, 61 pages, 0 errors, 0.9 1 pages/s, 334 366 kb/s, 127 URLs in 5 queues

8.2 Viewing results

./bin/nutch readdb -crawlId TestCrawl -stats

can also be viewed in hbase shell,

cd ~/hbase-0.90.4./bin/hbase shellhbase(main):001:0> scan ‘TestCrawl_webpage‘

The screen starts to output content and can end with CTRL + C.

When you run scan to view the contents of a table, you can view the file when the meaning of the column is conf/gora-hbase-mapping.xml undefined, which defines the column family and the meaning of the column.

Nutch Quick Start (Nutch 2.2.1+HBASE+SOLR)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More