Nutch First Experience (II.)

Source: Internet
Author: User
Tags mkdir web database

The previous few days introduced the basic information about nutch and how to use Nutch for Intranet crawling. The following is a full network of crawling (whole-web crawling) operation test.

The Nutch data includes two types:

Web database. Contains all the pages that Nutch can identify and the link information between those pages.

A collection of segments (segment). Each segment is a collection of pages that are fetched and indexed as a unit. Segment data includes the following types:

Fetchlist: Specifies a file for a collection of pages to get

Fetcher output: A collection of files that contain pages that are fetched

Index of the Index:fetcher output in Lucene format

Note: If there is a clear explanation, please refer here, but to tell the truth, Nutch's documents are far from perfect, there are many vague places.

Create the relevant directory and create an empty Web database:

[root@fc3 nutch]# mkdir db
[root@fc3 nutch]# mkdir segments
[root@fc3 nutch]# bin/nutch admin db -create
run java in /u01/app/oracle/product/10.1.0/db_1/jdk/jre
050104 122933 loading file:/u01/nutch/conf/nutch-default.xml
050104 122934 loading file:/u01/nutch/conf/nutch-site.xml
050104 122934 Created webdb at db
[root@fc3 nutch]# tree db
db
|-- dbreadlock
|-- dbwritelock
`-- webdb
   |-- linksByMD5
   |  |-- data
   |  `-- index
   |-- linksByURL
   |  |-- data
   |  `-- index
   |-- pagesByMD5
   |  |-- data
   |  `-- index
   `-- pagesByURL
     |-- data
     `-- index
5 directories, 10 files
[root@fc3 nutch]#

Next need to use "syringe (injector)" to "inject" the URL into the database. In Nutch's document, you get a collection of URLs from DMOZ and then take a subset to process them. But the document is too big. This is tested using the Content.example.txt file in the http://rdf.dmoz.org/rdf/directory.

[root@fc3 nutch]# bin/nutch inject db -dmozfile content.example.txt
run java in /u01/app/oracle/product/10.1.0/db_1/jdk/jre
050104 123105 loading file:/u01/nutch/conf/nutch-default.xml
050104 123106 loading file:/u01/nutch/conf/nutch-site.xml
050104 123106 skew = 1251308788
050104 123106 Begin parse
050104 123106 Using URL filter: net.nutch.net.RegexURLFilter
050104 123106 found resource regex-urlfilter.txt at file:/u01/nutch/conf/regex-urlfilter.txt
.050104 123106 Completed parse. Added 40 pages.
......
[root@fc3 nutch]#

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.