SOLR reads word, PDF

Source: Internet
Author: User
Tags apache solr solr

Over the past two days, I have been wondering whether to use Lucene for search applications or SOLR for search applications. Lucene only provides one queryable package. The advantage of using it for search is that I can use the corresponding functions provided by the application as needed. SOLR itself is a Lucene-based application that encapsulates Lucene, which is equivalent to development on the second layer. Therefore, it takes time to modify it. However, SOLR provides a lot of functions that Lucene does not have. No matter what it is, the teacher said that SOLR should be used, so use SOLR.

Comparison between Lucene and SOLR:

1. http://www.blogjava.net/luopeizhong/articles/321732.html

2. Apache SOLR: Lucene-based Scalable Cluster search Server

 

Lucene is more difficult to update indexes than SOLR. SOLR only needs to call the updaterequest function. setaction (abstractupdaterequest. action. commit, false, false) is updated, while Lucene needs to delete and update it first, otherwise it will become an incremental index.

Lucene update index: http://langhua9527.iteye.com/blog/582347

 

We have briefly introduced the installation and use of SOLR. Let's take a look at how to use solrj on the client to create an index and query.

Import Java. io. ioexception; <br/> Import Java. util. arraylist; <br/> Import Java. util. collection; </P> <p> Import Org. apache. SOLR. client. solrj. solrquery; <br/> Import Org. apache. SOLR. client. solrj. solrserver; <br/> Import Org. apache. SOLR. client. solrj. solrserverexception; <br/> Import Org. apache. SOLR. client. solrj. impl. commonshttpsolrserver; <br/> Import Org. apache. SOLR. client. solrj. request. abstractupdaterequest; <br/> Import Org. apache. SOLR. client. solrj. request. updaterequest; <br/> Import Org. apache. SOLR. client. solrj. response. queryresponse; <br/> Import Org. apache. SOLR. common. solrinputdocument; </P> <p> public class solrenttest {</P> <p> Public static void main (string [] ARGs) throws ioexception, <br/> solrserverexception {</P> <p> string urlstring = "http: // localhost: 8080/SOLR "; <br/> solrserver Server = new commonshttpsolrserver (urlstring); </P> <p> solrinputdocument doc1 = new solrinputdocument (); <br/> doc1.addfield ("ID ", 12); <br/> doc1.addfield ("content", "My test is easy, test SOLR"); <br/> solrinputdocument doc2 = new solrinputdocument (); <br/> doc2.addfield ("ID", "solrj simple test"); <br/> doc2.addfield ("content", "doc2 "); <br/> collection <solrinputdocument> docs = new arraylist <solrinputdocument> (); <br/> docs. add (doc1); <br/> docs. add (doc2); <br/> server. add (DOCS); <br/> updaterequest Req = new updaterequest (); <br/> req. setaction (abstractupdaterequest. action. commit, false, false); <br/> req. add (DOCS); <br/> req. process (server); </P> <p> solrquery query = new solrquery (); </P> <p> query. setquery ("test"); <br/> query. sethighlight (true ). sethighlightsnippets (1); <br/> query. setparam ("Hl. FL "," content "); </P> <p> queryresponse ret = server. query (query); </P> <p> system. out. println (RET); <br/>}< br/>}

To run solrj successfully, you must import the following packages.

From/Dist:

Apache-solr-solrj-3.1.0.jar

From/Dist/solrj-Lib:
Commons-codec-1.4.jar
Commons-httpclient-3.1.jar
Jcl-over-slf4j-1.5.5.jar
Slf4j-api-1.5.5.jar

The following package needs to be officially downloaded, because I did not find this jar package in solr3.1, it is estimated that there are
Slf4j-jdk14-1.5.5.jar

SOLR combines Apache Tika from version 1.4. Tika is a collection of content extraction tools (a toolkit for text extracting ). It integrates poi and product_box and provides a unified interface for text extraction. Using this tool in SOLR can easily extract rich texts such as PDF and word.

 

My version is 3.1. in the implementation process, I took a lot of detours and finally solved it myself. Let's share it with you.

Package test; </P> <p> Import Java. io. file; <br/> Import Java. io. ioexception; <br/> Import Org. apache. SOLR. client. solrj. solrserver; <br/> Import Org. apache. SOLR. client. solrj. solrserverexception; </P> <p> Import Org. apache. SOLR. client. solrj. request. abstractupdaterequest; <br/> Import Org. apache. SOLR. client. solrj. response. queryresponse; <br/> Import Org. apache. SOLR. client. solrj. solrquery; <br/> Import Org. apache. SOLR. client. solrj. impl. commonshttpsolrserver; <br/> Import Org. apache. SOLR. client. solrj. request. contentstreamupdaterequest; </P> <p>/** <br/> * @ author Aidy 2011.6.9 <br/> */<br/> public class solrexampletests {</P> <p> public static void main (string [] ARGs) {<br/> try {<br/> // SOLR cell can also index MS file (2003 version and 2007 version) types. <br/> string filename = "D: // test // deleetest // 1.20."; <br/> // This will be unique ID used by SOLR to index the file contents. <br/> string solrid = "1.20."; </P> <p> indexfilessolrcell (filename, solrid); </P> <p>} catch (exception ex) {<br/> system. out. println (ex. tostring (); <br/>}</P> <p>/** <br/> * method to index all types of files into SOLR. <br/> * @ Param filename <br/> * @ Param solrid <br/> * @ throws ioexception <br/> * @ throws solrserverexception <br/> */<br /> Public static void indexfilessolrcell (string filename, string solrid) <br/> throws ioexception, solrserverexception {</P> <p> string urlstring = "http: // localhost: 8080/SOLR "; <br/> solrserver SOLR = new commonshttpsolrserver (urlstring); </P> <p> contentstreamupdaterequest up <br/> = new contentstreamupdaterequest ("/update/extract "); </P> <p> up. addFile (new file (filename); </P> <p> up. setparam ("literal. ID ", solrid); <br/> up. setparam ("fmap. content "," attr_content "); </P> <p> up. setaction (abstractupdaterequest. action. commit, true, true); </P> <p> SOLR. request (up); </P> <p> queryresponse RSp = SOLR. query (New solrquery ("*: *"); </P> <p> system. out. println (RSP); <br/>}</P> <p>

It was in SOLR at the beginning. the request (up) error is returned. The Tomcat error indicates that the ignored_meta type is not available. It is not understood at the beginning, because my configuration file schema. there is no such type in XML. At the beginning, I thought it was caused by the version. I went to solr1.4 specially and did not report an error. Later I thought it was because in the example of getting started, I modified the configuration file schema. XML, while solrconfig. the xml configuration file contains the ignored _ type reference at the/update/extract node. after XML is added to the ignored _ type, it runs normally.

 

The following describes how to use solrj to query and display the query results on the web page, because the query results return XML format.

 

If SOLR is version 1.3 or below, see: http://wiki.apache.org/solr/UpdateRichDocuments

References:

1. http://wiki.apache.org/solr/ExtractingRequestHandler
2. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.