nutch2.3 Crawler crawls movie website

Source: Internet
Author: User
Tags deprecated solr

The previous article describes the installation of the Nutch

This article will simply crawl the site http://www.6vhao.com

1. Open Directory Nutch-2.3/runtime/local

2,mkdir URLs

Nano urls/URL: Add link

Http://www.6vhao.com Save exit

3, use the command in the local directory

./bin/nutch will show all commands that can be used

 inject         inject new urls into the  database hostinject     creates or updates an existing  host table from a text file generate        generate new batches to fetch from crawl db fetch           fetch URLs marked during generate  parse          parse urls marked during  fetch updatedb       update web table after  Parsing updatehostdb   update host table after parsing readdb          read/dump records from page database  readhostdb     display entries from the hostDB index           run the plugin-based indexer on parsed batches  elasticindex   run the elasticsearch indexer - DEPRECATED  use the index command instead solrindex      run  the solr indexer on parsed batches - deprecated use the  index command instead solrdedup      remove duplicates  from solr solrclean      remove http 301 and 404  documents from solr - DEPRECATED use the clean command  instead clean          remove http 301  And 404 documents and duplicates from indexing backends configured via plugins  Parsechecker   check the parser for a given url indexchecker    check the indexing filters for a given url plugin          load a plugin and run one  Of its classes main ()  nutchserver    run a  (local)   nutch server on a user defined port webapp          run a local Nutch web application junit           runs the given JUnit test or  Classname      run the class named classname

3, we first use the./bin/crawl command to crawl Web pages one-stop

4. After crawling is complete, enter the HBase directory

The./bin/hbase Shell enters the hbase shell, using list to see the current table:data_webpage,Nutch adds a suffix to it

5,hbase Shell Scan 'data_webpage' to view its contents, copy the sample data

 tv.66ys.www:http/zy/                column=f:ts, timestamp=1446050113914, value=\x00\x00\x01P\xAFM\xA9s                                       tv.66ys.www:http/zy/                column=il:http:// www.66ys.tv/, timestamp=1446050113914, value=\xe7\xbb\xbc\xe8\x89\xba                   tv.66ys.www:http/zy/                column=mk:dist, timestamp= 1446050113914, value=2                                                           tv.66ys.www:http/zy/                column=mtdt:_csh_, timestamp=1446050113914, value=\x00\x00\x00\x00                                         tv.66ys.www:http/zy/                column=s:s, timestamp=1446050113914, value=\x00\x00\x00\x00

Read more about it next time ~~~~~~~~~~~~~~~

nutch2.3 Crawler crawls movie website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.