Combination of nutch crawling and SOLR search

Source: Internet
Author: User
Tags solr

I don't know why the search war file that comes with nutch is gone in the Nutch-1.3, and in the Nutch-1.3, After grabbing the file, the generated directory is only crawldb, linkdb, segments, and no indexs and INDEX DIRECTORY

Check the official site wiki, the above is the nutch index ing to SOLR, with SOLR to provide the search function, see the official site wiki Description: http://wiki.apache.org/nutch/RunningNutchAndSolr

I think the Nutch-1.2 with the search function is very good to use, search interface and Baidu, Google almost, according to the keyword found out, the results are also similar to Baidu, Google, in addition, the keywords in the results are highlighted and displayed on pages. The SOLR search interface is unfriendly and the returned results are unfriendly. The results are returned in XML format.

Compare the two to search for the same keyword. It is found that the results of the nutch search are repeated, while SOLR is not repeated, and SOLR can be found in CONF/schema. in XML, the field attribute is configured. It seems that the configuration file has not been changed. For example, I want to store content and termvector in the index, but the search results are not stored.

After you use the nutch solrindex command to map the nutch index to SOLR, will the previous ing be content or parse_text? After reading the source code org. Apache. nutch. indexer. SOLR. solrindexer, we found that parse_text is mapped to the previous one.

The content in nutch is text with HTML tags, so the result of the search is displayed in the form of a webpage, so the corresponding content is content, while the content of the SOLR search is mapped to the previous parse_text

If you want the index to be indexed by mahout Lucene. to convert a vector into a vector, you need to store the termvector attribute. After termvector is added to Lucene or SOLR, there will be more entries in the index directory. TVD ,. tvf ,. in three tvx files, if the command parameters include -- norm 2, it means Euclidean vector space. When calculating the distance, the distance is measured by Euclidean space. Of course, this parameter can be set to others, this is another vector space model. For details, see the instructions on the official website.

If an index is directly converted to a mahout vector, its key is converted to an integer number. After the index is converted to a vector, the file key-value is (longwritable, vectorwritalbe ), that is, there is no corresponding original URL, and you do not know how the result corresponds to the original file.

If you use mahout seqdirectory first and then mahout seq2saprse, you can see that the key is the file name when you use seqdumper to read the Vector file, but the value does not correspond to a number at this time, when vectordump is used to read data, the key and value Arrays can be displayed by adding-P to the parameter. However, it is very troublesome to reverse query the URL from the input vector by the value of the final result? The best way is to output the key to the final result.

It is estimated that SOLR search is more powerful than nutch search, so it is discarded in the Nutch-1.3. The above is just a small comparison. As for the specific reasons, I haven't figured out yet. When I have time, I will make a good comparison.

 

2011-10-27 supplement:

Why is there a combination of nutch + SOLR: http://apps.hi.baidu.com/share/detail/33659525, this article introduces the index in hadoop contrib is built by using mapreduce Lucene index, not for search. Placing indexes on HDFS is to use the computing performance of the hadoop platform to merge indexes and perform other operations. These operations are much better than single-host processing on the hadoop platform. Reference: http://lucene.472066.n3.nabble.com/Lucene-index-file-on-HDFS-td932203.html

Nutch is not suitable for cooperative distributed search, because the index on HDFS may not be on the same node, and the search may need to request n nodes to complete. In this way, the local search performance of the local index files that are searched on HDFS is slightly inferior. References: http://blog.csdn.net/telnetor/article/details/6143365

SOLR advantages: http://baike.baidu.com/view/943234.htm
1. The SOLR cache is more efficient than the front-end built in nutch. SOLR is an enterprise-level full-text search. It provides a wide range of query languages and offers more search functions, such as spelling checks.

2. SOLR is configurable and scalable, and query performance is optimized. It provides efficient and flexible cache and vertical search functions, and a web-based management interface.

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.