Http://www.tuicool.com/articles/VfEFjm
Nutch 2.x compared to Nutch 1.x, stripped out of the storage layer, put into the Gora, you can use a variety of databases, such as HBase, Cassandra, MySQL to store data. Nutch 1.7 is to store data directly on HDFs.
1. Install and run HBase
For simplicity, use standalone mode, refer to HBase Quick start
1.1 Downloads, Unzip
http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gztar zxf hbase-0.90.4.tar.gz
1.2 Modifying Conf/hbase-site.xml
The contents are as follows
<Configuration><property> <name> Hbase.rootdir</name> < Value>file:///directory/hbase</value> </< Span class= "title" >property> <property> << Span class= "title" >name>hbase.zookeeper.property.datadir</name> span class= "tag" ><value>/directory/zookeeper</ value> </property></< Span class= "title" >CONFIGURATION>
hbase.rootdir
The directory is used to store information about HBase, and the default value is that the /tmp/hbase-${user.name}/hbase
hbase.zookeeper.property.dataDir
directory is used to store information about zookeeper (HBase has built-in zookeeper), and the default value is /tmp/hbase-${user.name}/zookeeper
.
1.3 Start
$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out
1.4 Try the shell.
$./bin/hbase Shell HBase Shell; Enter ' help ' for list of supported commands. Type "Exit" to leave the HBase Shell Version 0.90.4, r1150278, Sun Jul 15:53:29 PDT 2011
HBase (Main):001:0>
Create a table with a name, with test
only one column, named cf
. To verify that the creation was successful, use the list
command to view all the table and put
insert some values with the command.
HBase (Main):003:0>Create' Test ',' CF '0 row (s)Inch1.2200 secondshbase (Main):003:0> List' Test '.1 row (s)Inch0.0550 secondshbase (Main):004:0> put ' Row1 ', ' cf:a ', ' value1 ' 0 row (s) in Span class= "number" >0.0560 secondshbase (Main): 005:0> put ' Row2 ', ' cf:b ', ' value2 ' 0 row (s) in 0.0370 secondshbase (Main): 006:0> put ' test ', ' row3 ', ' value3 ' 0 row (s) in 0.0450 seconds
scan
Scan the table with the command to verify that the insertion was successful.
hbase(main):007:0> scan ‘test‘ROW COLUMN+CELLrow1 column=cf:a, timestamp=1288380727188, value=value1row2 column=cf:b, timestamp=1288380738440, value=value2row3 column=cf:c, timestamp=1288380747365, value=value33 row(s) in 0.0590 seconds
Now, disable and drop your watch, which will clear all of the above operations.
drop ‘test‘0 row(s) in 0.0770 seconds
Exit the shell,
hbase(main):014:0> exit
1.5 Stop
$ ./bin/stop-hbase.shstopping hbase...............
1.6 Start again
After running Nutch, you need to store the data in HBase, so you need to start hbase.
$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out
2 compiling nutch 2.2.12.1 download, unzip
http://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gztar zxf apache-nutch-2.2.1-src.tar.gz
2.2 Modifying a configuration file
Reference Nutch 2.0 Tutorial
Modifyconf/nutch-site.xml
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description></property>
Modifyivy/ivy.xml
<!-- Uncomment this to use HBase as Gora backend. --><dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
Modified to conf/gora.properties
ensure that HBaseStore
it is set as the default storage,
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
2.3 Compiling
ant runtime
A lot of jars will be downloaded at the beginning, and it will take some time to wait.
It is possible that you will get the following error:
Trying to override old definition of task javac load definitions from resource org/sonar/ant/antlib.xml. It could not be found.ivy-probe-antlib:ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
It doesn't matter, don't care about it.
It will take a while for the compilation to finish. After compiling, more than two folders of build and runtime are out.
Steps 3rd, 4, 5, and 6 are exactly the same as the 3rd, 4, 5, 6 step in another blog Nutch QuickStart (Nutch 1.7).
3 Adding a Seed URL
mkdir ~/urlsvim ~/urls/seed.txthttp://movie.douban.com/subject/5323968/
4 setting URL filtering rules
If you just want to crawl some type of URL, you can conf/regex-urlfilter.txt
set a regular expression, so only URLs that match these regular expressions will be crawled.
For example, I just want to crawl the data of a watercress movie, which can be set:
#注释掉这一行# skip URLs containing certain characters as probable queries, etc.#-[?*[email protected]=]# accept anything else#注释掉这行#+.+^http:\/\/movie\.douban\.com\/subject\/[0-9]+\/(\?.+)?$
5 Setting the agent name
Conf/nutch-site.xml:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value></property>
This step is seen from this book, Web crawling and Data Mining with Apache Nutch, page 14th.
6 Installing SOLR
Because SOLR is required to build the index, we need to install and start a SOLR server.
Refer to Nutch Tutorial 4th, 5, 6 steps, and SOLR Tutorial.
6.1 Downloads, unzip
wget http://mirrors.cnnic.cn/apache/lucene/solr/4.6.1/solr-4.6.1.tgz tar-zxf solr-4.6.1.tgz
6.2 Running SOLR
start.jar
Verify that startup is successful
Open http://localhost:8983/solr/admin/with a browser, if you can see the page, the start success.
6.3 Integrating Nutch with SOLR
NUTCH_DIR/conf/schema-solr4.xml
Copy to SOLR_DIR/solr/collection1/conf/
, rename to Schema.xml, and add a line at the <fields>...</fields>
end (see SOLR 4.2-what is _version_field for details),
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
Restart SOLR,
start.jar
The 7th and 8th steps are similar to the 7 and 8 steps in the Nutch 1.7 blog post. The main difference is that all of the data in the Nutch 2.x is no longer stored as files and directories on the hard disk, but in HBase.
71 Steps to crawl a Web page with a single command
Todo
8 using crawl script one-click Crawl
Just now we are manually typing multiple commands, a step, to complete the crawl, in fact, Nutch brought a script, the ./bin/crawl
various steps of the crawl into a command, look at its usage
<seedDir> <crawlID> <solrURL> <numberOfRounds>
Note that here is crawlId
, no longer is crawlDir
.
Delete the data from section 7th and use the HBase shell to disable
delete the table.
8.1 Crawling Web pages
$ ./bin/crawl ~/urls/ TestCrawl http://localhost:8983/solr/ 2
~/urls
is the directory where the seed URL is stored
- Testcrawl is Crawlid, which creates a crawlid-prefixed table in hbase, such as Testcrawl_webpage.
- http://localhost:8983/solr/, this is the SOLR server
- 2,numberofrounds, Number of iterations
After a while, the screen appeared a lot of URLs, you can see the crawler is crawling!
Fetching http:music.douban.com/subject/25811077/(Queue crawl delay=5000ms) fetching http:read.douban.com/ebook/1919781 (Queue crawl delay=5000ms) fetching http:www.douban.com/online/11670861/(Queue crawl delay=5000ms) fetching http:book.douban.com/tag/(Queue crawl delay=5000ms) fetching http:movie.douban.com/tag/Sci Fi (queue crawl delay=5000ms)49/Spinwaiting/active,Pages,0 errors,0.91 pages/s,332 245 kb/s, 131 URLs in 5 queuesfetching http://music.douban.com/subject/25762454/(queue crawl delay= 5000MS) fetching Http://read.douban.com/reader/ebook/1951242/(queue crawl delay=5000ms) Fetching Http://www.douban.com/mobile/read-notes (queue crawl delay=5000ms) fetching http:< span class= "comment" >//book.douban.com/tag/poetry (Queue crawl delay=5000ms) 50/ 50 spinwaiting/active, 61 pages, 0 errors, 0.9 1 pages/s, 334 366 kb/s, 127 URLs in 5 queues
8.2 Viewing results
./bin/nutch readdb -crawlId TestCrawl -stats
can also be viewed in hbase shell,
cd ~/hbase-0.90.4./bin/hbase shellhbase(main):001:0> scan ‘TestCrawl_webpage‘
The screen starts to output content and can end with CTRL + C.
When you run scan to view the contents of a table, you can view the file when the meaning of the column is conf/gora-hbase-mapping.xml
undefined, which defines the column family and the meaning of the column.
Nutch Quick Start (Nutch 2.2.1+HBASE+SOLR)