Nutch 1.3 Study Notes 8 linkdb

Source: Internet
Author: User

Nutch 1.3 Study Notes 8 linkdb
----------------------------
Here we mainly analyze org. Apache. nutch. Crawl. linkdb, which is mainly used to calculate the reverse link.

1. Run the command bin/nutch invertlinks

HELP parameter description:

Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]linkdboutput LinkDb to create or update-dir segmentsDirparent directory of several segments, ORseg1 seg2 ... list of segment directories-forceforce update even if LinkDb appears to be locked (CAUTION advised)-noNormalizedon't normalize link URLs-noFilterdon't apply URLFilters to link URLs

The local running result is:

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ bin/nutch invertlinks dB/linkdb/DB/segments/20110822105243/linkdb: starting at 2011-08-29 09: 21: 36 linkdb: DB/linkdblinkdb: URL normalize: truelinkdb: url filter: truelinkdb: Adding segment: DB/segments/20110822105243 // Add the new segment library linkdb: merging with existing linkdb: DB/linkdb // merge with the cause database linkdb: finished at 09:21:40, elapsed: 00:00:03

2. Main linkdb source code analysis

In linkdb, we call an invert method. This method does two things,
+ Analyze the newly entered Segment directory to generate a new reverse link library
+ Merge the New reverse link library with the original Library

2.1 analyze the newly entered Segment directory. The main code is as follows:

// Create a new MP Task jobconf job = linkdb. createjob (getconf (), linkdb, normalize, filter); // Add a directory to the input path. Multiple Input paths may exist, parse_data for (INT I = 0; I <segments. length; I ++) {If (log. isinfoenabled () {log.info ("linkdb: Adding segment:" + segments [I]);} fileinputformat. addinputpath (job, new path (segments [I], parsedata. dir_name);} // submit the MP Task try {jobclient. runjob (job);} catch (ioexception e) {lockutil. removelockfile (FS, lock); throw E ;}

Let's take a look at what createjob has done:

Private Static jobconf createjob (configuration config, path linkdb, Boolean normalize, Boolean filter) {// create a temporary directory path newlinkdb = New Path ("linkdb-" + integer. tostring (new random (). nextint (integer. max_value); jobconf job = new nutchjob (config); job. setjobname ("linkdb" + linkdb); // set the output format job. setinputformat (sequencefileinputformat. class); // configure map, combiner, CER method job. setmapperclass (linkdb. c Lass); job. setcombinerclass (linkdbmerger. class); // if the old linkdb directory is not found after filtering or normalization is configured, configure it with filter and normalize. // if we don't run the mergejob, perform normalization/filtering now if (normalize | filter) {try {filesystem FS = filesystem. get (config); If (! FS. exists (linkdb) {job. setboolean (linkdbfilter. url_filtering, filter); job. setboolean (linkdbfilter. url_normalizing, normalize) ;}} catch (exception e) {log. warn ("linkdb createjob:" + E) ;}} job. setreducerclass (linkdbmerger. class); // configure the MP output path fileoutputformat. setoutputpath (job, newlinkdb); // configure the output format job. setoutputformat (mapfileoutputformat. class); // compress the map output to reduce the input pressure of the reducer job. setboolean ("mapred. output. compress ", true); // configure the output type of <key, value> job. setoutputkeyclass (text. class); job. setoutputvalueclass (inlinks. class); Return job ;}

Next, let's take a look at what map has done in linkdb. This method is mainly to establish a ing from tourl => fromurl, which is a bit like termid => docid In the inverted index.
The linkdbmerger class implements the reducer interface, which collects the fromurl of the same tourl of a specified number. The specified number may be set by DB. Max. inlinks.

2.2 merge the New reverse link library with the old one. The main code is as follows:

If (FS. exists (currentlinkdb) {// if an old reverse link library exists, merge if (log. isinfoenabled () {log.info ("linkdb: merging with existing linkdb:" + linkdb) ;}// try to merge // path newlinkdb = fileoutputformat. getoutputpath (job); job = linkdbmerger. createmergejob (getconf (), linkdb, normalize, filter); // Add the input path fileinputformat. addinputpath (job, currentlinkdb); fileinputformat. addinputpath (job, newlinkdb); try {jobclient. runjob (job);} catch (ioexception e) {lockutil. removelockfile (FS, lock); FS. delete (newlinkdb, true); throw E;} FS. delete (newlinkdb, true);} linkdb. install (job, linkdb); // install the new reverse link library

Let's take a look at what createmergejob has done:

Public static jobconf createmergejob (configuration config, path linkdb, Boolean normalize, Boolean filter) {// generate a temporary directory path newlinkdb = New Path ("linkdb-merge-" + integer. tostring (new random (). nextint (integer. max_value); jobconf job = new nutchjob (config); job. setjobname ("linkdb merge" + linkdb); // configure the output format job. setinputformat (sequencefileinputformat. class); // configure the map and reducer methods. Here, the reducer method is the same as the above, used to aggregate values with the same key (tourl) // then output the specified number of values, the linkdbfilter here should be used to filter and normalize the URL corresponding to the key and value. setmapperclass (linkdbfilter. class); job. setboolean (linkdbfilter. url_normalizing, normalize); job. setboolean (linkdbfilter. url_filtering, filter); job. setreducerclass (linkdbmerger. class); // configure the output path fileoutputformat. setoutputpath (job, newlinkdb); job. setoutputformat (mapfileoutputformat. class); job. setboolean ("mapred. output. compress ", true); job. setoutputkeyclass (text. class); job. setoutputvalueclass (inlinks. class); Return job ;}

3. bin/nutch readlinkdb Analysis

It is mainly used to download the content of linkdb to the specified directory. The help is as follows:

Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)-dump <out_dir>dump whole link db to a text file in <out_dir>-url <url>print information about <url> to System.out

The following is the result of running the local machine:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch readlinkdb db/linkdb/ -dump output2LinkDb dump: starting at 2011-08-29 09:54:08LinkDb dump: db: db/linkdb/LinkDb dump: finished at 2011-08-29 09:54:09, elapsed: 00:00:01

The following is part of the output file in the output2 directory. We can see that here is a <key, value> pair, key is tourl, and value is fromurl.

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ head output2/part-00000 items fromurl: http://www.baidu.com/Anchor: Encyclopedia http://hi.baidu.com/inlinks: fromurl: http://www.baidu.com/Anchor: Space items fromurl: http://www.baidu.com/anchor: http://home.baidu.com/inlinks:

This readlinkdb also uses an MP task. The input format is sequencefileinputformat, the output format is textoutput, and map-CER is used by default.

4. bin/nutch mergelinkdb Analysis

It is mainly used to merge different linkdb databases.

   Usage: LinkDbMerger <output_linkdb> <linkdb1> [<linkdb2> <linkdb3> ...] [-normalize] [-filter]output_linkdboutput LinkDblinkdb1 ...input LinkDb-s (single input LinkDb is ok)-normalizeuse URLNormalizer on both fromUrls and toUrls in linkdb(s) (usually not needed)-filteruse URLFilters on both fromUrls and toUrls in linkdb(s)

In fact, the merge is to call the createmergejob method in linkdbmerger analyzed above.

5. Summary
Here, we mainly calculate the reverse link for the external links analyzed in the parse_data directory. These reverse links will be used when the link scores are calculated below.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.