SOLR is a very useful open-source index and search tool under Apache. Although there are many materials on the Internet, it is very complicated. I spent a day researching SOLR, the basic applications of SOLR are summarized. The configuration method involved in this article is not unique, and the API usage method may not be standardized. I only hope that you can use this article to get started with SOLR faster.
I. installation environment and configuration of SOLR:
1. download the required software and install and configure Tomcat
First download Tomcat and SOLR. Tomcat can be downloaded from various software sites, such as the following:
Http://mirror.bjtu.edu.cn/apache/lucene/solr/1.4.1/apache-solr-1.4.1.zip
Tomcat can be installed according to the installation process. In this article, Tomcat is installed on disk X and changed to 8983 (8080) when setting the listening port. Of course, it can also be changed in the configuration file, path: X: \ Tomcat 6.0 \ conf \ Server. XML, and add the encoding format to the UTF-8 so that SOLR can correctly parse the query requirements for URL delivery.
2. Build a file folder
Decompress APACHE-SOLR-1.4.1.zip to a folder, as shown in the following figure:
Create the SOLR directory on disk X and copy all the content in the example directory in the extracted file to SOLR. At this time, an empty directory named webapps will be created under the SOLR directory, copy the apache-solr-1.4.1.war file under the DIST directory in the extracted directory to webapps and rename it SOLR. war.
3. Configure the SOLR Working Environment
Note about SOLR in Tomcat. Under X: \ Tomcat 6.0 \ conf \ Catalina \ localhost (manually created if the folder does not exist), create the configuration file SOLR. XML with the following content:
<Context docbase = "X:/SOLR/webapps/SOLR. War" reloadable = "true">
<Environment name = "SOLR/home" type = "Java. Lang. String" value = "X:/SOLR" override = "true"/>
</Context>
Docbase is the content published on the webpage, and environment is the SOLR configuration environment.
4. Configure the index data format
The schema. xml file in X: \ SOLR \ conf can be used to configure the index data format. Although scheme. XML is long, it has a wide array of gaze and most of its content belongs to the definition of data types. We only need to add our own data between the <fields> and </fields> labels. Of course, we can also define our own types.
Take the first field in the example as an example:
<Field name = "ID" type = "string" indexed = "true" stored = "true" required = "true"/>
The type is string, which requires an index and needs to be stored. It is a field that cannot be blank and easy to understand.
Other configurations are as follows:
<Uniquekey> id </uniquekey>
Configure the unique key value of the entire index to distinguish different index entries.
<Defasearchsearchfield> text </defasearchsearchfield>
The default search field. If no special field is specified during the search, the field is searched.
<Solrqueryparser defaultoperator = "or"/>
When searching for different keyword processing methods for the same field, select "and" or "based on the project needs ".
<Copyfield source = "cat" DEST = "text"/>
<Copyfield source = "name" DEST = "text"/>
............
This setting can be set for a large number of fields to be searched by default. If there is only one field to be searched, you only need to configure it in defasearchsearchfield.
5. Execute SOLR
Start tomcat, and then ask http: // localhost: 8983/SOLR/admin/to view the Home Page. The page consists of three parts: configuration status, query, and help links.
The http: // localhost: 8983/SOLR/admin/analysis. jsp page can be used to check the working status of the word divider;
The http: // localhost: 8983/SOLR/admin/form. jsp page can simulate a search request to build a request URL.
When Tomcat is started for the first time after SOLR configuration is complete, many new files and directories will appear in Tomcat. There will be a SOLR directory under tomcat, which is used to store the index; a solr directory is also available in the webapps directory of Tomcat to store the webproject; in the logs directory of Tomcat, a log file starting with Catalina is generated. We can find SOLR loading exceptions and index and query URL records.
6. Add an index
When Tomcat and SOLR are enabled, we can add indexes. Go to the X: \ SOLR \ exampledocs directory and we can see a lot of XML files and post. Jar files. Here is the place to store the files to be indexed in XML format. Open SOLR. xml and we can see that:
<Add>
<Doc>
<Field name = "ID"> solr1000 </field>
<Field name = "name"> SOLR, the Enterprise Search server </field>
<Field name = "Manu"> Apache Software Foundation </field>
............
<Field name = "incubationdate_dt"> 2006-01-17t00: 00: 00.000z </field>
</DOC>
</Add>
The Add Tag indicates that an index is to be added (other tags are included in the SOLR document or SOLR wiki). Each field has data. Note that the data of the last date type is different from that in Java.
So how can we transmit this XML to SOLR for indexing? Post. jar is required. Open the command line, enter the directory where post. jar is located, and execute:
Java-durl = http: // localhost: 8983/SOLR/update-dcommit = yes-jar post. jar *. xml
After the program is successfully executed, the index is added. Adding an index can also be implemented through solrj programming through the Java interface provided by SOLR, which will be discussed below.
After the index is successfully added, you can use the SOLR web interface mentioned above to search for the benchmark.
7. Add Chinese Word Segmentation
The author uses ikanalyzer's latest version ikanalyzer3.2.5stable. jar, which provides good support for SOLR and can be easily searched on various software sites or csdn. Put ikanalyzer3.2.5stable. Ja in the X: \ Tomcat 6.0 \ webapps \ SOLR \ WEB-INF \ lib directory for use.
Because SOLR uses the built-in word divider of Lucene, you need to change the configuration to convert it to ikanalyzer. The change is the schema. xml file under X: \ SOLR \ conf. The change is marked in red:
<Fieldtype name = "text" class = "SOLR. textfield" positionincrementgap = "100">
<Analyzer type = "Index">
<Tokenizer class = "org. wltea. analyzer. SOLR. iktokenizerfactory" ismaxwordlength = "false"/>
............
</Analyzer>
<Analyzer type = "query">
<Tokenizer class = "org. wltea. analyzer. SOLR. iktokenizerfactory" ismaxwordlength = "true"/>
............
</Analyzer>
</Fieldtype>
The word divider is the iktokenizerfactory that comes with ikanalyzer and supports SOLR. It does not perform word segmentation based on the maximum match when indexing, but does perform word segmentation based on the maximum match when searching.
Ikanalyzer also supports custom dictionaries. First, create the classes directory under X: \ Tomcat 6.0 \ webapps \ SOLR \ WEB-INF \, create the ikanalyzer. cfg. xml file in the classes directory, the content is:
<? XML version = "1.0" encoding = "UTF-8"?>
<! Doctype properties (View Source for full doctype...)>
<Properties version = "1.0">
<Comment> ik analyzer Extension Configuration </comment>
<Entry key = "ext_dict">/mydict. DIC </entry>
<Entry key = "ext_stopwords">/mystopword. DIC </entry>
</Properties>
Mydict. DIC for your own definition of Word Segmentation dictionary, mystopword. DIC for your definition of Stop Word Dictionary, DIC files should be saved as UTF-8 format, the first act blank line, each word occupies a line. The storage location of DIC files is also the classes folder.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.