Recently, I have been studying nutch and found information about crawling the entire network using underlying commands.
First obtain the URL set, use the content.example.txt file under the http://rdf.dmoz.org/rdf/ directory for testing, create a folder dmoz
Command: Bin/nutch org. Apache. nutch. Tools. d1_parser content.example.txt> dmoz/URLs
Inject the website to the crawldb database:
Command: Bin/nutch inject crawl/crawldb dmoz
Create a capture list:
Command: Bin/nutch generate crawl/crawldb crawl/segments
Save the files under segments to variable S1 for future calls:
Command: S1 = 'LS-D crawl/segments/2 * | tail-1'
Command: Echo $ S1
Note 'is not a single quotation mark, but the upper left corner and ~ The one-key location
Run fetcher to obtain the URL Information:
Command: Bin/nutch fetch $ S1
Update the database and save the obtained page information to the database:
Command: Bin/nutch updatedb crawl/crawldb $ S1
The first capture ends.
Next, select the URL with the top 10 points for the second and third captures:
Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10
Command: S2 = 'LS-D crawl/segments/2 * | tail-1'
Command: Echo $ S2
Command: Bin/nutch fetch $ S2
Command: Bin/nutch updatedb crawl/crawldb $ S2
Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10
Command: S3 = 'LS-D crawl/segments/2 * | tail-1'
Command: Echo $ S3
Command: Bin/nutch fetch $ S3
Command: Bin/nutch updatedb crawl/crawldb $ S3
Update the linkdb Database Based on segments:
Command: Bin/nutch invertlinks crawl/linkdb crawl/segments /*
Index creation:
Command: Bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments /*
You can use this command to query:
Command: Bin/nutch org. Apache. nutch. searcher. nutchbean FAQ the FAQ here indicates the keyword to be searched.