The content to be rewritten in the nutch.

Source: Internet
Author: User

IntroductionnutchAs an open-source search engine, it has made great contributions to lower the threshold of the entire search engine market. However, because the code is completed by multiple individuals, and its main goal is to search for the entire network, there are still many problems to use it for enterprise-level search or vertical search. Modifying the configuration file alone cannot solve these problems. Therefore, you need to modify or rewrite the code of the nutch.
Below are some of the details that I think we need to modify and rewrite for our project:

  1. Throughout the crawl process, the old and new URLs must be processed separately. In addition, in the same crawl process, no matter what layer, should not repeatedly capture the web pages that have been captured at the previous level. Currently, there is no such control in the nutch. You can only use fetchschedule to control the captured webpage and the old webpage at a specified time.
  2. Crawldb. UpdateThis part of the Code and its chaos, the logic I can only find out a rough, but a small part of it does not understand.
  3. Fetcher: there is no need to sort the final captured data. In addition, the entire Code also needs to be restructured into separate classes to reduce the total number of codes in a file. In addition, although the algorithm for the number of connections and Delay Control for each host in fetcher is correct, I prefer to make the httpclient connectionManager to manage this function.
  4. Lib-HTTP,Protocol-http: This part of the nutch is currently manually parsing the socket communication text to process the HTTP protocol, which often causes errors. It should be modified to use a ready-made HTTP library.
  5. IndexPackageNutchdocumentAnd a series of dependent classesNutchanalyer and so on. This part has problems with word segmentation. It does not use the Lucene word divider, but writes a word divider with javacc. Even if we don't need Chinese word segmentation, we should rewrite this part and use the Lucene word divider.
  6. EneratorThe last generated fetchList order; the current implementation is to use the URL hashThis algorithm does not obtain fetch that is sufficiently scattered by host.List. The best way is to count the number of URLs of each host as n_ I, and the total number of URLs as m, then the location of the j url of the I host should beN_ I J/M. sort by this order to ensure the obtained URL.List is distributed by host. This avoids multiple connections to a host at the same time.
  7. This class of nutchbean and many classes under the search package need to be rewritten. Because the implementation of these classes does not take into account the issue of file sharing, locking, and unlocking. If the index is being created for the nutch, the client will call the nutchbean to cause file sharing conflicts. Of course, maybe the process of user query and indexing will never happen at the same time. The solution to this problem is to maintain an internal pointer as you said last time. All index files are numbered by date and each time they need to be retrieved, first, check whether the number of the currently opened index file is the latest. If not, open the new index file.
  8. Encoding and identification problems. The snapshot of the web page of nutch is always garbled. There is a problem with its encoding and recognition module. It calls icu4j encoding and recognition, but it seems that the recognition rate is not high. Now I change to manual parse.HTML,XML, and uses the Firefox recognition engine for identification, the effect is much better.
  9. The maximum number of connections, connection latency, URL filtering, proxy server, capture depth, and the total number of web pages captured at each site cannot be set independently by using the global settings only. However, this may cause some problems. For example, website A has a large access volume, so the number of links should be smaller and the latency should be longer. However, the number of connections to other websites can be larger, the latency can be shorter. In this way, both the overall speed and the correctness of network communication can be taken into account.

Think of this for the moment, but there may be other problems.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.