SOLR reads word, PDF

Last Update:2018-12-03 Source: Internet

Author: User

Tags apache solr solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Over the past two days, I have been wondering whether to use Lucene for search applications or SOLR for search applications. Lucene only provides one queryable package. The advantage of using it for search is that I can use the corresponding functions provided by the application as needed. SOLR itself is a Lucene-based application that encapsulates Lucene, which is equivalent to development on the second layer. Therefore, it takes time to modify it. However, SOLR provides a lot of functions that Lucene does not have. No matter what it is, the teacher said that SOLR should be used, so use SOLR.

Comparison between Lucene and SOLR:

1. http://www.blogjava.net/luopeizhong/articles/321732.html

2. Apache SOLR: Lucene-based Scalable Cluster search Server

Lucene is more difficult to update indexes than SOLR. SOLR only needs to call the updaterequest function. setaction (abstractupdaterequest. action. commit, false, false) is updated, while Lucene needs to delete and update it first, otherwise it will become an incremental index.

Lucene update index: http://langhua9527.iteye.com/blog/582347

We have briefly introduced the installation and use of SOLR. Let's take a look at how to use solrj on the client to create an index and query.

Import Java. io. ioexception; Import Java. util. arraylist; Import Java. util. collection; Import Org. apache. SOLR. client. solrj. solrquery; Import Org. apache. SOLR. client. solrj. solrserver; Import Org. apache. SOLR. client. solrj. solrserverexception; Import Org. apache. SOLR. client. solrj. impl. commonshttpsolrserver; Import Org. apache. SOLR. client. solrj. request. abstractupdaterequest; Import Org. apache. SOLR. client. solrj. request. updaterequest; Import Org. apache. SOLR. client. solrj. response. queryresponse; Import Org. apache. SOLR. common. solrinputdocument; public class solrenttest { Public static void main (string [] ARGs) throws ioexception, solrserverexception { string urlstring = "http: // localhost: 8080/SOLR "; solrserver Server = new commonshttpsolrserver (urlstring); solrinputdocument doc1 = new solrinputdocument (); doc1.addfield ("ID ", 12); doc1.addfield ("content", "My test is easy, test SOLR"); solrinputdocument doc2 = new solrinputdocument (); doc2.addfield ("ID", "solrj simple test"); doc2.addfield ("content", "doc2 "); collection <solrinputdocument> docs = new arraylist <solrinputdocument> (); docs. add (doc1); docs. add (doc2); server. add (DOCS); updaterequest Req = new updaterequest (); req. setaction (abstractupdaterequest. action. commit, false, false); req. add (DOCS); req. process (server); solrquery query = new solrquery (); query. setquery ("test"); query. sethighlight (true ). sethighlightsnippets (1); query. setparam ("Hl. FL "," content "); queryresponse ret = server. query (query); system. out. println (RET); } }

To run solrj successfully, you must import the following packages.

From/Dist:

Apache-solr-solrj-3.1.0.jar

From/Dist/solrj-Lib:
Commons-codec-1.4.jar
Commons-httpclient-3.1.jar
Jcl-over-slf4j-1.5.5.jar
Slf4j-api-1.5.5.jar

The following package needs to be officially downloaded, because I did not find this jar package in solr3.1, it is estimated that there are
Slf4j-jdk14-1.5.5.jar

SOLR combines Apache Tika from version 1.4. Tika is a collection of content extraction tools (a toolkit for text extracting ). It integrates poi and product_box and provides a unified interface for text extraction. Using this tool in SOLR can easily extract rich texts such as PDF and word.

My version is 3.1. in the implementation process, I took a lot of detours and finally solved it myself. Let's share it with you.

Package test; Import Java. io. file; Import Java. io. ioexception; Import Org. apache. SOLR. client. solrj. solrserver; Import Org. apache. SOLR. client. solrj. solrserverexception; Import Org. apache. SOLR. client. solrj. request. abstractupdaterequest; Import Org. apache. SOLR. client. solrj. response. queryresponse; Import Org. apache. SOLR. client. solrj. solrquery; Import Org. apache. SOLR. client. solrj. impl. commonshttpsolrserver; Import Org. apache. SOLR. client. solrj. request. contentstreamupdaterequest; /** * @ author Aidy 2011.6.9 */ public class solrexampletests { public static void main (string [] ARGs) { try { // SOLR cell can also index MS file (2003 version and 2007 version) types. string filename = "D: // test // deleetest // 1.20."; // This will be unique ID used by SOLR to index the file contents. string solrid = "1.20."; indexfilessolrcell (filename, solrid); } catch (exception ex) { system. out. println (ex. tostring (); } /** * method to index all types of files into SOLR. * @ Param filename * @ Param solrid * @ throws ioexception * @ throws solrserverexception */ Public static void indexfilessolrcell (string filename, string solrid) throws ioexception, solrserverexception { string urlstring = "http: // localhost: 8080/SOLR "; solrserver SOLR = new commonshttpsolrserver (urlstring); contentstreamupdaterequest up = new contentstreamupdaterequest ("/update/extract "); up. addFile (new file (filename); up. setparam ("literal. ID ", solrid); up. setparam ("fmap. content "," attr_content "); up. setaction (abstractupdaterequest. action. commit, true, true); SOLR. request (up); queryresponse RSp = SOLR. query (New solrquery ("*: *"); system. out. println (RSP); }

It was in SOLR at the beginning. the request (up) error is returned. The Tomcat error indicates that the ignored_meta type is not available. It is not understood at the beginning, because my configuration file schema. there is no such type in XML. At the beginning, I thought it was caused by the version. I went to solr1.4 specially and did not report an error. Later I thought it was because in the example of getting started, I modified the configuration file schema. XML, while solrconfig. the xml configuration file contains the ignored _ type reference at the/update/extract node. after XML is added to the ignored _ type, it runs normally.

The following describes how to use solrj to query and display the query results on the web page, because the query results return XML format.

If SOLR is version 1.3 or below, see: http://wiki.apache.org/solr/UpdateRichDocuments

References:

1. http://wiki.apache.org/solr/ExtractingRequestHandler
2. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More