The previous few days introduced the basic information about nutch and how to use Nutch for Intranet crawling. The following is a full network of crawling (whole-web crawling) operation test.
The Nutch data includes two types:
Web database. Contains all the pages that Nutch can identify and the link information between those pages.
A collection of segments (segment). Each segment is a collection of pages that are fetched and indexed as a unit. Segment data includes the following types:
Fetchlist: Specifies a file for a collection of pages to get
Fetcher output: A collection of files that contain pages that are fetched
Index of the Index:fetcher output in Lucene format
Note: If there is a clear explanation, please refer here, but to tell the truth, Nutch's documents are far from perfect, there are many vague places.
Create the relevant directory and create an empty Web database:
[root@fc3 nutch]# mkdir db
[root@fc3 nutch]# mkdir segments
[root@fc3 nutch]# bin/nutch admin db -create
run java in /u01/app/oracle/product/10.1.0/db_1/jdk/jre
050104 122933 loading file:/u01/nutch/conf/nutch-default.xml
050104 122934 loading file:/u01/nutch/conf/nutch-site.xml
050104 122934 Created webdb at db
[root@fc3 nutch]# tree db
db
|-- dbreadlock
|-- dbwritelock
`-- webdb
|-- linksByMD5
| |-- data
| `-- index
|-- linksByURL
| |-- data
| `-- index
|-- pagesByMD5
| |-- data
| `-- index
`-- pagesByURL
|-- data
`-- index
5 directories, 10 files
[root@fc3 nutch]#
Next need to use "syringe (injector)" to "inject" the URL into the database. In Nutch's document, you get a collection of URLs from DMOZ and then take a subset to process them. But the document is too big. This is tested using the Content.example.txt file in the http://rdf.dmoz.org/rdf/directory.
[root@fc3 nutch]# bin/nutch inject db -dmozfile content.example.txt
run java in /u01/app/oracle/product/10.1.0/db_1/jdk/jre
050104 123105 loading file:/u01/nutch/conf/nutch-default.xml
050104 123106 loading file:/u01/nutch/conf/nutch-site.xml
050104 123106 skew = 1251308788
050104 123106 Begin parse
050104 123106 Using URL filter: net.nutch.net.RegexURLFilter
050104 123106 found resource regex-urlfilter.txt at file:/u01/nutch/conf/regex-urlfilter.txt
.050104 123106 Completed parse. Added 40 pages.
......
[root@fc3 nutch]#