Bin/hadoop jar APACHE-nutch- 1.7 . Job org. Apache. nutch. Crawl. crawldbreader crawl/crawler LDB-stats- SORT will find a lot of unfetched because: nutch - Default . XML is used for generate. The score is limited. Generate is used only when the value is greater than 0. Therefore, many low scores are not collected. (If you dump the Data URL, you will find that the unfetched URL has a negative score, and the negative value is quite large.) consider commenting out: // Consider only entries with a score superior to the threshold If (Scorethreshold! = Float. Nan & sort <scorethreshold) Return Finally, I think the modification is as follows: <Property> <Name> Generate. Min. score </Name> <value> 0 </Value> <description> Select only entries with a score larger than generate. Min. Score. </Description> </property> Value is - 1
Reasons for many URLs unfetched in nutch