The content to be rewritten in the nutch.

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

IntroductionnutchAs an open-source search engine, it has made great contributions to lower the threshold of the entire search engine market. However, because the code is completed by multiple individuals, and its main goal is to search for the entire network, there are still many problems to use it for enterprise-level search or vertical search. Modifying the configuration file alone cannot solve these problems. Therefore, you need to modify or rewrite the code of the nutch.
Below are some of the details that I think we need to modify and rewrite for our project:

Throughout the crawl process, the old and new URLs must be processed separately. In addition, in the same crawl process, no matter what layer, should not repeatedly capture the web pages that have been captured at the previous level. Currently, there is no such control in the nutch. You can only use fetchschedule to control the captured webpage and the old webpage at a specified time.
Crawldb. UpdateThis part of the Code and its chaos, the logic I can only find out a rough, but a small part of it does not understand.
Fetcher: there is no need to sort the final captured data. In addition, the entire Code also needs to be restructured into separate classes to reduce the total number of codes in a file. In addition, although the algorithm for the number of connections and Delay Control for each host in fetcher is correct, I prefer to make the httpclient connectionManager to manage this function.
Lib-HTTP,Protocol-http: This part of the nutch is currently manually parsing the socket communication text to process the HTTP protocol, which often causes errors. It should be modified to use a ready-made HTTP library.
IndexPackageNutchdocumentAnd a series of dependent classesNutchanalyer and so on. This part has problems with word segmentation. It does not use the Lucene word divider, but writes a word divider with javacc. Even if we don't need Chinese word segmentation, we should rewrite this part and use the Lucene word divider.
EneratorThe last generated fetchList order; the current implementation is to use the URL hashThis algorithm does not obtain fetch that is sufficiently scattered by host.List. The best way is to count the number of URLs of each host as n_ I, and the total number of URLs as m, then the location of the j url of the I host should beN_ I J/M. sort by this order to ensure the obtained URL.List is distributed by host. This avoids multiple connections to a host at the same time.
This class of nutchbean and many classes under the search package need to be rewritten. Because the implementation of these classes does not take into account the issue of file sharing, locking, and unlocking. If the index is being created for the nutch, the client will call the nutchbean to cause file sharing conflicts. Of course, maybe the process of user query and indexing will never happen at the same time. The solution to this problem is to maintain an internal pointer as you said last time. All index files are numbered by date and each time they need to be retrieved, first, check whether the number of the currently opened index file is the latest. If not, open the new index file.
Encoding and identification problems. The snapshot of the web page of nutch is always garbled. There is a problem with its encoding and recognition module. It calls icu4j encoding and recognition, but it seems that the recognition rate is not high. Now I change to manual parse.HTML,XML, and uses the Firefox recognition engine for identification, the effect is much better.
The maximum number of connections, connection latency, URL filtering, proxy server, capture depth, and the total number of web pages captured at each site cannot be set independently by using the global settings only. However, this may cause some problems. For example, website A has a large access volume, so the number of links should be smaller and the latency should be longer. However, the number of connections to other websites can be larger, the latency can be shorter. In this way, both the overall speed and the correctness of network communication can be taken into account.

Think of this for the moment, but there may be other problems.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The content to be rewritten in the nutch.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The content to be rewritten in the nutch.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support