Underlying commands for crawling the entire network

Source: Internet
Author: User

Recently, I have been studying nutch and found information about crawling the entire network using underlying commands.

First obtain the URL set, use the content.example.txt file under the http://rdf.dmoz.org/rdf/ directory for testing, create a folder dmoz

Command: Bin/nutch org. Apache. nutch. Tools. d1_parser content.example.txt> dmoz/URLs

Inject the website to the crawldb database:

Command: Bin/nutch inject crawl/crawldb dmoz

Create a capture list:

Command: Bin/nutch generate crawl/crawldb crawl/segments

Save the files under segments to variable S1 for future calls:

Command: S1 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S1

Note 'is not a single quotation mark, but the upper left corner and ~ The one-key location

Run fetcher to obtain the URL Information:

Command: Bin/nutch fetch $ S1

Update the database and save the obtained page information to the database:

Command: Bin/nutch updatedb crawl/crawldb $ S1

The first capture ends.

Next, select the URL with the top 10 points for the second and third captures:

Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10

Command: S2 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S2

Command: Bin/nutch fetch $ S2

Command: Bin/nutch updatedb crawl/crawldb $ S2

Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10

Command: S3 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S3

Command: Bin/nutch fetch $ S3

Command: Bin/nutch updatedb crawl/crawldb $ S3

Update the linkdb Database Based on segments:

Command: Bin/nutch invertlinks crawl/linkdb crawl/segments /*

Index creation:

Command: Bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments /*

You can use this command to query:

Command: Bin/nutch org. Apache. nutch. searcher. nutchbean FAQ the FAQ here indicates the keyword to be searched.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.