Nutch 1.3 Study Notes 11-1 page scoring mechanism OPIC

Source: Internet
Author: User

Nutch 1.3 Study Notes 11-1 page scoring mechanism OPIC
--------------------------------------
1. Page scoring mechanism of nutch 1.3

At present, nutch1.3 uses OPIC as its webpage score algorithm by default. However, the PageRank-like algorithm has been introduced to make up for the shortcomings of the OPIC algorithm, currently, the OPIC algorithm is implemented as an extension of the scorefilter extension point in the nutch, and the new linkrank algorithm is called Org. apache. nutch. scoring. the webgraph package is used to calculate the scores of webpages. It can solve the problem that OPIC cannot solve. One is to repeatedly crawl pages, which will increase the importance of those crawled pages; the other is that newly added pages must be crawled at the same time, which will increase the total cash flow of the entire network. This will reduce the importance of pages that are not repeatedly crawled.

2. What is the OPIC Algorithm and Its Features?

The following content is from

The OPIC algorithm is for static graphs. The basic idea of the OPIC algorithm is that each page has an initial cash. When a page is crawled, the cash of the page is evenly allocated to the page to which it is directed, the total cash volume in the entire network diagram is a fixed value. In the process of capturing webpages, these cash volumes flow between pages, which is intuitive, the importance of the page in the OPIC algorithm is defined as the proportion of the total amount of cash flowing over the page in the circulation process.

For each network (node in the figure), The OPIC algorithm maintains two values: cash and history. Cash is the current cash value of the web page, history indicates the sum of cash obtained from the OPIC algorithm to the last capture. The initial value of cash is generally 1/n (n is the total number of webpages), and the initial value of history is 0.

The OPIC algorithm uses two vectors C [1 ,..., N] and H [1 ,..., N] indicates the cash value and history value of each web page. To optimize the algorithm, a variable G is also introduced, G = | H | = Σ I h [I] for each web page capture. The pseudocode OF THE OPIC algorithm in the original paper is as follows:

OPIC:  On-line Page Importance Computation    for each i let C[i] := 1/n ;  for each i let H[i] := 0 ;  let G:=0 ;  do forever  begin   choose some node i ;   %% each node is selected   %% infinitely often     H[i] += C[i];   %% single disk access per page     for each child j of i,   do C[j] += C[i]/out[i];   %% Distribution of cash   %% depends on L     G += C[i];   C[i] := 0 ;  end  



Several questions about the OPIC algorithm:
1. Processing of sink pages without external links:
The OPIC algorithm has the concept of virtual page. There is a two-way link between a virtual page and all webpages.
2. Convergence:
The OPIC algorithm integrates the calculation of Web Page importance into the web page capture process. The OPIC algorithm relies on repeated crawling. An important problem is (*) the value of formula is converged in the Process of repeated page capturing. Only this algorithm is correct and meaningful. For proof of convergence, there is a strict proof in the original paper. I will only remind you here.
3. Capture Policy
As mentioned above, the OPIC algorithm relies on repeated crawling, so the crawling policy is an important issue. The crawling policy directly affects the convergence speed of the network importance (*) type. In fact, both theory and experiment prove that the greedy method is the best strategy, that is, the pages with high cash values are preferentially crawled.

In order to solve the convergence problem of OPIC algorithms, some people proposed the adaptive OPIC algorithm, which introduced the concept of a time window, its main point is to integrate the calculation of the importance of web pages into the process of Web Page capturing, which simplifies the model and solves the importance of web pages.

3. Application of OPIC in nutch

In the annotation of the opicscoringfilter class in the source code org. Apache. nutch. scoring. OPIC package of nutch1.3, the link analysis algorithm implemented by nutch is based on adaptive on-line page importance computaion. Nutch uses it as an scoringfilter plug-in, that is, users can expand their own score algorithms,
Parseoutputformat is used to prepare for score calculation, while recordwriter in fetchoutputformat integrates parseoutputformat. After capturing parsed webpages, It will be written through the recordwriter generated by parseoutputformat, the OPIC calculation method is called in this recordwriter.

4. Source Code analysis of the nutch OPIC

The following describes the distributescoretooutlinks method of opicscoringfilter. The source code is as follows:

Float score = scoreinjected; // get the insert score, but it seems useless. // get the initial score after resolution. This score is set before fetchthread parses the webpage. // scfilters. passscorebeforeparsing (Key, datum, content); string scorestring = parsedata. getcontentmeta (). get (nutch. score_key); If (scorestring! = NULL) {try {score = float. parsefloat (scorestring);} catch (exception e) {e. printstacktrace (logutil. getwarnstream (log) ;}// get the valid number of webpages int validcount = targets. size (); If (countfiltered) {score/= allcount;} else {If (validcount = 0) {// No outlinks to distribute score, so just return Adjust Return adjust;} score/= validcount;} // internal and external score factor float internalsco Re = score * internalscorefactor; // set the score value of the inner link, multiplied by the Weight Factor of the inner link. The default value is 1.0f float externalscore = score * externalscorefactor; // set the value of the outer link, multiplied by the weight factor of an outer link. The default value is 1.0f for (Entry <text, crawler ldatum> Target: targets) {try {string tohost = new URL (target. getkey (). tostring ()). gethost (); string fromhost = new URL (fromurl. tostring ()). gethost (); If (tohost. repeated signorecase (fromhost) {target. getvalue (). setscore (Intern Alscore); // set the contribution of the inner link} else {target. getvalue (). setscore (externalscore); // set contribution of external links} catch (malformedurlexception e) {e. printstacktrace (logutil. getwarnstream (log); target. getvalue (). setscore (externalscore) ;}// XXX (AB) No adjustment? I think this is contrary to the algorithm descr. // XXX in the paper, where page "loses" its score if it's distributed to // XXX linked pages... return adjust ;}

5. Summary

In Web crawling, the quality of sorting algorithms directly affects the UPDATE results of search engines. This is especially true for focusing crawlers. It is possible that the OPIC will not be used after the nuch 2.0, but the new scoring function will be used, which can be found in org. Apache. nutch. scoring. webgraph.

6. Reference

[1] fixing the OPIC Algorithm in nutch http://wiki.apache.org/nutch/FixingOpicScoring
[2] abiteboul et al., 2003 http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
[3] http://www.endless-loops.com/2011/03/nutch%E6%BA%90%E7%A0%81%E4%B8%AD%E7%9A%84%E9%93%BE%E6%8E%A5%E5%88%86%E6%9E%90%E7% AE %97%E6%B3%95-497.html
[4] http://wiki.apache.org/nutch/FixingOpicScoring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.