Nutch 1.3 Study Notes 8 linkdb

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch 1.3 Study Notes 8 linkdb
----------------------------
Here we mainly analyze org. Apache. nutch. Crawl. linkdb, which is mainly used to calculate the reverse link.

1. Run the command bin/nutch invertlinks

HELP parameter description:

Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) [-force] [-noNormalize] [-noFilter]linkdboutput LinkDb to create or update-dir segmentsDirparent directory of several segments, ORseg1 seg2 ... list of segment directories-forceforce update even if LinkDb appears to be locked (CAUTION advised)-noNormalizedon't normalize link URLs-noFilterdon't apply URLFilters to link URLs

The local running result is:

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ bin/nutch invertlinks dB/linkdb/DB/segments/20110822105243/linkdb: starting at 2011-08-29 09: 21: 36 linkdb: DB/linkdblinkdb: URL normalize: truelinkdb: url filter: truelinkdb: Adding segment: DB/segments/20110822105243 // Add the new segment library linkdb: merging with existing linkdb: DB/linkdb // merge with the cause database linkdb: finished at 09:21:40, elapsed: 00:00:03

2. Main linkdb source code analysis

In linkdb, we call an invert method. This method does two things,
+ Analyze the newly entered Segment directory to generate a new reverse link library
+ Merge the New reverse link library with the original Library

2.1 analyze the newly entered Segment directory. The main code is as follows:

// Create a new MP Task jobconf job = linkdb. createjob (getconf (), linkdb, normalize, filter); // Add a directory to the input path. Multiple Input paths may exist, parse_data for (INT I = 0; I <segments. length; I ++) {If (log. isinfoenabled () {log.info ("linkdb: Adding segment:" + segments [I]);} fileinputformat. addinputpath (job, new path (segments [I], parsedata. dir_name);} // submit the MP Task try {jobclient. runjob (job);} catch (ioexception e) {lockutil. removelockfile (FS, lock); throw E ;}

Let's take a look at what createjob has done:

Private Static jobconf createjob (configuration config, path linkdb, Boolean normalize, Boolean filter) {// create a temporary directory path newlinkdb = New Path ("linkdb-" + integer. tostring (new random (). nextint (integer. max_value); jobconf job = new nutchjob (config); job. setjobname ("linkdb" + linkdb); // set the output format job. setinputformat (sequencefileinputformat. class); // configure map, combiner, CER method job. setmapperclass (linkdb. c Lass); job. setcombinerclass (linkdbmerger. class); // if the old linkdb directory is not found after filtering or normalization is configured, configure it with filter and normalize. // if we don't run the mergejob, perform normalization/filtering now if (normalize | filter) {try {filesystem FS = filesystem. get (config); If (! FS. exists (linkdb) {job. setboolean (linkdbfilter. url_filtering, filter); job. setboolean (linkdbfilter. url_normalizing, normalize) ;}} catch (exception e) {log. warn ("linkdb createjob:" + E) ;}} job. setreducerclass (linkdbmerger. class); // configure the MP output path fileoutputformat. setoutputpath (job, newlinkdb); // configure the output format job. setoutputformat (mapfileoutputformat. class); // compress the map output to reduce the input pressure of the reducer job. setboolean ("mapred. output. compress ", true); // configure the output type of <key, value> job. setoutputkeyclass (text. class); job. setoutputvalueclass (inlinks. class); Return job ;}

Next, let's take a look at what map has done in linkdb. This method is mainly to establish a ing from tourl => fromurl, which is a bit like termid => docid In the inverted index.
The linkdbmerger class implements the reducer interface, which collects the fromurl of the same tourl of a specified number. The specified number may be set by DB. Max. inlinks.

2.2 merge the New reverse link library with the old one. The main code is as follows:

If (FS. exists (currentlinkdb) {// if an old reverse link library exists, merge if (log. isinfoenabled () {log.info ("linkdb: merging with existing linkdb:" + linkdb) ;}// try to merge // path newlinkdb = fileoutputformat. getoutputpath (job); job = linkdbmerger. createmergejob (getconf (), linkdb, normalize, filter); // Add the input path fileinputformat. addinputpath (job, currentlinkdb); fileinputformat. addinputpath (job, newlinkdb); try {jobclient. runjob (job);} catch (ioexception e) {lockutil. removelockfile (FS, lock); FS. delete (newlinkdb, true); throw E;} FS. delete (newlinkdb, true);} linkdb. install (job, linkdb); // install the new reverse link library

Let's take a look at what createmergejob has done:

Public static jobconf createmergejob (configuration config, path linkdb, Boolean normalize, Boolean filter) {// generate a temporary directory path newlinkdb = New Path ("linkdb-merge-" + integer. tostring (new random (). nextint (integer. max_value); jobconf job = new nutchjob (config); job. setjobname ("linkdb merge" + linkdb); // configure the output format job. setinputformat (sequencefileinputformat. class); // configure the map and reducer methods. Here, the reducer method is the same as the above, used to aggregate values with the same key (tourl) // then output the specified number of values, the linkdbfilter here should be used to filter and normalize the URL corresponding to the key and value. setmapperclass (linkdbfilter. class); job. setboolean (linkdbfilter. url_normalizing, normalize); job. setboolean (linkdbfilter. url_filtering, filter); job. setreducerclass (linkdbmerger. class); // configure the output path fileoutputformat. setoutputpath (job, newlinkdb); job. setoutputformat (mapfileoutputformat. class); job. setboolean ("mapred. output. compress ", true); job. setoutputkeyclass (text. class); job. setoutputvalueclass (inlinks. class); Return job ;}

3. bin/nutch readlinkdb Analysis

It is mainly used to download the content of linkdb to the specified directory. The help is as follows:

Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)-dump <out_dir>dump whole link db to a text file in <out_dir>-url <url>print information about <url> to System.out

The following is the result of running the local machine:

lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$ bin/nutch readlinkdb db/linkdb/ -dump output2LinkDb dump: starting at 2011-08-29 09:54:08LinkDb dump: db: db/linkdb/LinkDb dump: finished at 2011-08-29 09:54:09, elapsed: 00:00:01

The following is part of the output file in the output2 directory. We can see that here is a <key, value> pair, key is tourl, and value is fromurl.

Lemo @ Debian :~ /Workspace/Java/Apache/nutch/nutch-1.3 $ head output2/part-00000 items fromurl: http://www.baidu.com/Anchor: Encyclopedia http://hi.baidu.com/inlinks: fromurl: http://www.baidu.com/Anchor: Space items fromurl: http://www.baidu.com/anchor: http://home.baidu.com/inlinks:

This readlinkdb also uses an MP task. The input format is sequencefileinputformat, the output format is textoutput, and map-CER is used by default.

4. bin/nutch mergelinkdb Analysis

It is mainly used to merge different linkdb databases.

   Usage: LinkDbMerger <output_linkdb> <linkdb1> [<linkdb2> <linkdb3> ...] [-normalize] [-filter]output_linkdboutput LinkDblinkdb1 ...input LinkDb-s (single input LinkDb is ok)-normalizeuse URLNormalizer on both fromUrls and toUrls in linkdb(s) (usually not needed)-filteruse URLFilters on both fromUrls and toUrls in linkdb(s)

In fact, the merge is to call the createmergejob method in linkdbmerger analyzed above.

5. Summary
Here, we mainly calculate the reverse link for the external links analyzed in the parse_data directory. These reverse links will be used when the link scores are calculated below.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nutch 1.3 Study Notes 8 linkdb

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Nutch 1.3 Study Notes 8 linkdb

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support