The previous article describes the installation of the Nutch
This article will simply crawl the site http://www.6vhao.com
1. Open Directory Nutch-2.3/runtime/local
2,mkdir URLs
Nano urls/URL: Add link
Http://www.6vhao.com Save exit
3, use the command in the local directory
./bin/nutch will show all commands that can be used
inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse urls marked during fetch updatedb update web table after Parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB index run the plugin-based indexer on parsed batches elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead solrindex run the solr indexer on parsed batches - deprecated use the index command instead solrdedup remove duplicates from solr solrclean remove http 301 and 404 documents from solr - DEPRECATED use the clean command instead clean remove http 301 And 404 documents and duplicates from indexing backends configured via plugins Parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one Of its classes main () nutchserver run a (local) nutch server on a user defined port webapp run a local Nutch web application junit runs the given JUnit test or Classname run the class named classname
3, we first use the./bin/crawl command to crawl Web pages one-stop
4. After crawling is complete, enter the HBase directory
The./bin/hbase Shell enters the hbase shell, using list to see the current table:data_webpage,Nutch adds a suffix to it
5,hbase Shell Scan 'data_webpage' to view its contents, copy the sample data
tv.66ys.www:http/zy/ column=f:ts, timestamp=1446050113914, value=\x00\x00\x01P\xAFM\xA9s tv.66ys.www:http/zy/ column=il:http:// www.66ys.tv/, timestamp=1446050113914, value=\xe7\xbb\xbc\xe8\x89\xba tv.66ys.www:http/zy/ column=mk:dist, timestamp= 1446050113914, value=2 tv.66ys.www:http/zy/ column=mtdt:_csh_, timestamp=1446050113914, value=\x00\x00\x00\x00 tv.66ys.www:http/zy/ column=s:s, timestamp=1446050113914, value=\x00\x00\x00\x00
Read more about it next time ~~~~~~~~~~~~~~~
nutch2.3 Crawler crawls movie website