Combination of nutch crawling and SOLR search

Last Update:2018-12-03 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I don't know why the search war file that comes with nutch is gone in the Nutch-1.3, and in the Nutch-1.3, After grabbing the file, the generated directory is only crawldb, linkdb, segments, and no indexs and INDEX DIRECTORY

Check the official site wiki, the above is the nutch index ing to SOLR, with SOLR to provide the search function, see the official site wiki Description: http://wiki.apache.org/nutch/RunningNutchAndSolr

I think the Nutch-1.2 with the search function is very good to use, search interface and Baidu, Google almost, according to the keyword found out, the results are also similar to Baidu, Google, in addition, the keywords in the results are highlighted and displayed on pages. The SOLR search interface is unfriendly and the returned results are unfriendly. The results are returned in XML format.

Compare the two to search for the same keyword. It is found that the results of the nutch search are repeated, while SOLR is not repeated, and SOLR can be found in CONF/schema. in XML, the field attribute is configured. It seems that the configuration file has not been changed. For example, I want to store content and termvector in the index, but the search results are not stored.

After you use the nutch solrindex command to map the nutch index to SOLR, will the previous ing be content or parse_text? After reading the source code org. Apache. nutch. indexer. SOLR. solrindexer, we found that parse_text is mapped to the previous one.

The content in nutch is text with HTML tags, so the result of the search is displayed in the form of a webpage, so the corresponding content is content, while the content of the SOLR search is mapped to the previous parse_text

If you want the index to be indexed by mahout Lucene. to convert a vector into a vector, you need to store the termvector attribute. After termvector is added to Lucene or SOLR, there will be more entries in the index directory. TVD ,. tvf ,. in three tvx files, if the command parameters include -- norm 2, it means Euclidean vector space. When calculating the distance, the distance is measured by Euclidean space. Of course, this parameter can be set to others, this is another vector space model. For details, see the instructions on the official website.

If an index is directly converted to a mahout vector, its key is converted to an integer number. After the index is converted to a vector, the file key-value is (longwritable, vectorwritalbe ), that is, there is no corresponding original URL, and you do not know how the result corresponds to the original file.

If you use mahout seqdirectory first and then mahout seq2saprse, you can see that the key is the file name when you use seqdumper to read the Vector file, but the value does not correspond to a number at this time, when vectordump is used to read data, the key and value Arrays can be displayed by adding-P to the parameter. However, it is very troublesome to reverse query the URL from the input vector by the value of the final result? The best way is to output the key to the final result.

It is estimated that SOLR search is more powerful than nutch search, so it is discarded in the Nutch-1.3. The above is just a small comparison. As for the specific reasons, I haven't figured out yet. When I have time, I will make a good comparison.

2011-10-27 supplement:

Why is there a combination of nutch + SOLR: http://apps.hi.baidu.com/share/detail/33659525, this article introduces the index in hadoop contrib is built by using mapreduce Lucene index, not for search. Placing indexes on HDFS is to use the computing performance of the hadoop platform to merge indexes and perform other operations. These operations are much better than single-host processing on the hadoop platform. Reference: http://lucene.472066.n3.nabble.com/Lucene-index-file-on-HDFS-td932203.html

Nutch is not suitable for cooperative distributed search, because the index on HDFS may not be on the same node, and the search may need to request n nodes to complete. In this way, the local search performance of the local index files that are searched on HDFS is slightly inferior. References: http://blog.csdn.net/telnetor/article/details/6143365

SOLR advantages: http://baike.baidu.com/view/943234.htm
1. The SOLR cache is more efficient than the front-end built in nutch. SOLR is an enterprise-level full-text search. It provides a wide range of query languages and offers more search functions, such as spelling checks.

2. SOLR is configurable and scalable, and query performance is optimized. It provides efficient and flexible cache and vertical search functions, and a web-based management interface.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More