The nutch data contains three directory structures:
1. crawldb: used to store the URL Information to be searched by the nutch and the retrieval status (whether or not to retrieve and when to retrieve)
2. linkdb: used to store the hyperlink information (including the anchor) contained in each URL)
3. segments: a set of URLs. As a retrieval unit, segments can be used for Distributed Retrieval.
The Segment directory contains the following sub-directories:
(1) crawl_generate: defines the URL set to be retrieved (the file type is sequencefile)
(2) crawl_fetch: stores the retrieval status of each URL (the file type is mapfile)
(3) content: stores the binary byte stream corresponding to each URL (the file type is mapfile)
(4) parse_text: stores the text content parsed by each URL (the file type is mapfile)
(5) parse_data: stores the metadata parsed by each URL (the file type is mapfile)
(6) crawl_parse: Used to promptly update the content in the crawler LDB (for example, the URL to be retrieved does not exist)-the file type is sequencefile
Note: In view of the data structure and component structure of nutch, crawldb is equivalent to webdb, while segment is equivalent to fetchlists.
In the distributed crawl process, each mapreduce job generates a segment named by time.
Description of directories related to nutch