Lucene/SOLR development experience

Source: Internet
Author: User
Tags solr

1,Opening Speech
2,Overview
3,Origin
4,First knowledge of SOLR
5,Installation of SOLR
6,SOLR word segmentation Sequence
7,An example of SOLR Chinese Application
8,SOLR search operator

[Opening Speech] We should write a technical article by Convention. This time we will share our development experience with Lucene/SOLR.

Lucene is a full-text search Development Kit (API) Written in Java. It can be used to implement powerful search functions. For details, you can search for it on Google, this article focuses on SOLR-related discussions.

[Overview] Currently, there are not many SOLR researchers in China, and most of them are required for project development. SOLR is a project under the Apache Foundation. Specifically, it is a subproject under Lucene. SOLR is well-developed and has its own technical characteristics. It fills the regret that Lucene was used only as a development kit. It is a complete application. In other words, it is a full-text retrieval server, which is ready to use out of the box, so that we can immediately understand the powerful functions of Lucene and take a big step towards Lucene productization.

SOLR word splitting principle demo Interface

[Origin] Initially, CNET networks used the Lucene API to develop some applications and created a prototype of SOLR. Later, the Apache Software Foundation obtained SOLR with support from the Lucene top-level project, this is already in February. In January 17, 2006, SOLR joined the Apache Foundation incubator. During the project incubator, SOLR accumulated various features and attracted a stable user group, developer group, and committer group, the version 1.1.0 was successfully released one year later. The current stable version is 1.2. SOLR shines at the 2007 Apache Annual Meeting in March and will attend the Asia Open Source Software summit in Hong Kong by the end of May. Unfortunately, why not come to Beijing :-(

[First knowledge of SOLR] The SOLR server is different from a common relational database, not only in its core nature (for different structured and unstructured data), but also in its architecture. The SOLR server is usually deployed on the Application Server/Java container. (If local communication does not involve RPC, you can skip the Java container. For example, you can use SOLR as an embedded method ), it cannot work independently on JVM.

SOLR Architecture
The SOLR server can store data and quickly and efficiently retrieve it through indexes. HTTP/XML and JSON APIs are provided externally, which enables integration in multi-language environments, such as development for its clients. Currently, SOLR clients are oriented to Java, PHP, Python, C #, JSON, and Ruby, unfortunately, it is not intended for C/C ++ (this is my current research ), brian Whitman, who studied music search classification, used the JNI technology on the Apple platform to embed SOLR in C code for retrieval. It is just a cocoa project. With these clients, users can easily integrate SOLR into specific applications. Currently, the most complete Java client solrj is added to SOLR trunk and will be officially released in version 1.3.

If you do not research and develop SOLR, but use SOLR, you only need to pay attention to the following aspects of SOLR:
1. The SOLR server configuration is completed in solrconfig. XML, including the cache and Servlet personalized configuration, that is, the global configuration of the system;
2. The index method and index field are completed in schema. xml. This configuration is for SOLR instances;
3. The index data file is stored in the data/index directory under the root directory of the SOLR document by default. This path can be configured at, and the files under this directory can be copied and pasted, you can reuse indexes;
4. indexing takes a long time. I used the word-less dictionary indexing method to index Chinese records in 2g110, it took nearly two and a half hours (of course this time is related to many factors. If you are interested, you can leave a message and discuss it with me, in Linux, index creation takes much longer than in windows. You can use the commit operation to make the new index take effect. Pay attention to index optimization, which also takes a lot of resources and time, however, optimizing indexes is also an important way to increase the search speed. Therefore, we need to weigh this point;
5. There are several folders in the SOLR directory after installation: the bin folder is mainly used to create images and complete remote synchronization scripts; the conf folder contains the configuration files mentioned at and, and the admin folder contains files that provide the Web management interface;
6. Currently, solr1.2 does not have a security design, and there is no user group or permission settings. Pay attention to security when performing specific applications. The most effective method is to implement security through authorization on the application server.
Permanent link to this article: http://www.jinsehupan.com/blog? P = 25

[Installation of SOLR] There is already a small example of using jetty as a servlet container in SOLR release. You can use this example to experience it, what is the procedure on the platform and application server that you want to deploy?

To start using SOLR, install the following software:
1. Java 1.5 or later;
2. Ant 1.6.x or later (for compiling and managing SOLR projects, recommended for personal use, and eclipse );
3. Web browser used to view the management page (Firefox is officially recommended, but no difference exists with IE );
4. servlet containers, such as Tomcat 5.5 (version 6 is not recommended ). This article takes Tomcat running on port 8080 as an example. If another servlet container is running or another port is running, you may need to modify the URL in the code to access the sample application and SOLR.

Install the configuration as follows:

1. Use ant to compile the project or download the sample application, and copy the SOLR war file to the webapps directory of the servlet container;
2. Get the SOLR folder for later copying it to the current directory. You can use ant build or find it in the downloaded compressed package. Use it as the template for later modification;
3. You can set the SOLR master location in one of the following ways:
Set the java system property SOLR. SOLR. Home (yes, It is SOLR. SOLR. Home, which is usually used in Embedded Integration );
Configure a JNDI lookup in Java: COMP/ENV/SOLR/home to point to the SOLR directory and create/tomcat55/CONF/Catalina/localhost/SOLR. XML file. Note that the XML file name will be the SOLR Instance name. The current directory in 2 is specified as F:/solrhome in the following directory. The file content is as follows:

  1. <Context docbase = "F:/SOLR. War" DEBUG = "0" crosscontext = "true">
  2. <Environment name = "SOLR/home" type = "Java. Lang. String" value = "F:/solrhome" override = "true"/>
  3. </Context>

Start the servlet container in the directory containing the SOLR directory (the default SOLR main directory is SOLR under the current working directory );
4. The last point is that if there is a CJK (Chinese-Japanese-German text) Application and garbled characters, the following method is used to solve the problem (in fact, it is not a SOLR configuration problem, but an application server configuration problem ), modify the conf/Server of Tomcat. the connector in the XML file for port (8080 in this article) unified resource encoding is UTF-8, because solr1.2 kernel supports UTF-8 encoding:

  1. <Server...>
  2. <Service...>
  3. <Connector... uriencoding = "UTF-8"/>
  4. ...
  5. </Service>
  6. </Server>

[SOLR word segmentation Sequence] SOLR requires word segmentation for the string to create an index and to query keywords. When you add a full-text search index to the index database, SOLR uses spaces to perform word segmentation, then, use the specified filter to filter the word splitting results in sequence. The remaining results are added to the index database for query. The word segmentation sequence is as follows:
Index
1: Space whitespacetokenize
2: Filter word stopfilter
3: worddelimiterfilter
4: lower case filter lowercasefilter
5: English close word englishporterfilter
6: Remove duplicate words removeduplicatestokenfilter
Query
1: Query similar words
2: Filter words
3: Word splitting
4: lower-case Filtering
5: similar words in English
6. Remove duplicate words
The above is for English, except for spaces.

[An example of SOLR Chinese Application]
1. First, configure schema. XML, which is equivalent to the data table configuration file, which defines the data type of the data added to the index. Schema. xml of version 1.2 mainly includes types, fields, and some other default settings.

A. First define a fieldtype subnode in the types node, including parameters such as name, class, and positionincrementgap. Name is the name of fieldtype, and class points to Org. apache. SOLR. the class name corresponding to the analysis package, used to define this type of behavior. When fieldtype is defined, the most important thing is to define the analyzer used to index and query data of this type, including word segmentation and filtering. In this example, the fieldtype text is defined and SOLR is used in the index analyzer. whitespacetokenizerfactory is a space word segmentation package, and then SOLR is used. stopfilterfactory, SOLR. worddelimiterfilterfactory, SOLR. lowercasefilterfactory, SOLR. englishporterfilterfactory, SOLR. removeduplicatestokenfilterfactory filters. When a text index is added to the index library, SOLR first uses spaces for word segmentation, and then filters the word segmentation results using the specified filter in sequence, the remaining results will be added to the index database for query. SOLR's analysis package does not include a Chinese-supported package. Here we use the Language Pack in Lucene (in the downloaded SOLR compressed package, there is a lucene-analyzers-2.2.0.jar package under the lib directory, it contains the CN and CJK classes for Chinese processing. There are CN and CJK classes that support Chinese. We use the CJK class and add the following configuration to schema. xml:

  1. <Fieldtype name = "text_cjk" class = "SOLR. textfield">
  2. <Analyzer class = "org. Apache. Lucene. analysis. CJK. cjkanalyzer"/>
  3. </Fieldtype>

The supported types are defined.

B. The next step is to define a specific field (similar to a field in a database) in the fields node, that is, filed. The filed definition includes name and type (for various fieldtypes previously defined ), indexed, stored, multivalued, and so on. For example, the definition is as follows:

  1. <Field name = "Record Number" type = "slong" indexed = "true" stored = "true" required = "true"/>
  2. <Field name = "file name" type = "string" indexed = "true" stored = "true"/>
  3. <Field name = "date" type = "date" indexed = "true" stored = "true"/>
  4. <Field name = "" type = "string" indexed = "true" stored = "true" multivalued = "true"/>
  5. <Field name = "topic" type = "string" indexed = "true" stored = "true" multivalued = "true"/>
  6. <Field name = "title" type = "text_cjk" indexed = "true" stored = "true" multivalued = "true"/>
  7. <Field name = "author" type = "text_cjk" indexed = "true" stored = "true" multivalued = "true"/>
  8. <Field name = "body" type = "text_cjk" indexed = "true" stored = "true" multivalued = "true"/>
  9. <Field name = "flag" type = "text_cjk" indexed = "true" stored = "true" multivalued = "true"/>

Field definition is very important. There are several tips to note that you should try to set the multivalued attribute to true for multiple worthwhile fields to avoid index creation errors; if you do not need to store the corresponding field values, set the stored attribute to false.

C. We recommend that you create a copy field to copy all full-text fields to one field for unified search:

  1. <Field name = "text_com" type = "text_cjk" indexed = "true" stored = "false" multivalued = "true"/>

Complete the copy settings at the copy field node:

  1. <Copyfield source = "title" DEST = "text_com"/>
  2. <Copyfield source = "body" DEST = "text_com"/>

D. You can also define dynamic fields. The so-called dynamic field does not need to specify a specific name. As long as you define a field name rule, such as defining a dynamicfield with the name * _ I, define its type as text, so when using this field, any field ending with _ I is considered to comply with this definition, such as name_ I, gender_ I, school_ I, etc.

2. Configure solrconfig. XML is used to configure some system attributes of SOLR. The most important one is that you can change the datadir attribute to specify the storage location of the index file, when there is a large amount of data, you need to configure automatic commit operations. The following settings Enable Automatic write operations to the disk when the memory index volume reaches 20 W, to avoid heap overflow, this is also an effective method to solve the problem of saving a single XML file to 30 mb:

  1. <Autocommit>
  2. <Maxdoss> 200000 </maxdocs>
  3. </Autocommit>

3. After these configurations are complete, restart the SOLR server to make the configuration take effect and add data to the server.

4. Adding data is implemented by updating servlet post XML format data to the server. The XML structure is that there are many DOC files in the Add process, and each Doc has many fields. Each record added to the index database must specify a unique numerical ID to uniquely identify the index. After creating an XML file (such as SOLR. XML), run Java-jar post. Jar SOLR. xml in the exampledocs directory to add the index data. For the post jar package, if the application server is reconfigured, for example, comcat is used, the port is changed to 8080, And the Instance name is changed to solrx, the corresponding post. jar package needs to be regenerated for operation.

The following is an example of how ronghao implements Chinese word segmentation for your reference:

For full-text search, Chinese Word Segmentation is very important. Here we use qieqie Ding (very good :)). Integration is very easy. I downloaded version 2.0.4-alpha2, which supports splitting at most and sharding by maximum. Create your own Chinese tokenizerfactory inherited from SOLR's basetokenizerfactory.

**

* Created by intellij idea.

* User: ronghao

* Date: 2007-11-3

* Time: 14:40:59

* Encapsulation of Ding based on Chinese terms

*/

Public class chinesetokenizerfactory extends basetokenizerfactory {

/**

* Default splitting Mode

*/

Public static final string most_words_mode = "Most-words ";

/**

* Split by maximum

*/

Public static final string max_word_length_mode = "Max-word-length ";

Private string mode = NULL;

Public void setmode (string mode ){

If (mode = NULL | most_words_mode.equalsignorecase (Mode)

| "Default". equalsignorecase (mode )){

This. mode = most_words_mode;

} Else if (max_word_length_mode.w.signorecase (mode )){

This. mode = max_word_length_mode;

}

Else {

Throw new illegalargumentexception ("invalid analyzer mode parameter settings:" + mode );

}

}

@ Override

Public void Init (MAP ARGs ){

Super. INIT (ARGs );

Setmode (ARGs. Get ("mode "));

}

Public tokenstream create (Reader Input ){

Return new paodingtokenizer (input, paodingmaker. Make (),

Createtokencollector ());

}

Private tokencollector createtokencollector (){

If (most_words_mode.equals (mode ))

Return new mostwordstokencollector ();

If (max_word_length_mode.equals (mode ))

Return new maxwordlengthtokencollector ();

Throw new error ("never happened ");

}

}

Add the tokenizer to the text configuration of the schema. xml field.

  1. <Fieldtype name = "text" class = "SOLR. textfield" positionincrementgap = "100">
  2. <Analyzer type = "Index">
  3. <Tokenizer class = "com. ronghao. fulltextsearch. analyzer. chinesetokenizerfactory" mode = "Most-words"/>
  4. <Filter class = "SOLR. stopfilterfactory" ignorecase = "true" words = "stopwords.txt"/>
  5. <Filter class = "SOLR. worddelimiterfilterfactory" generatewordparts = "1" generatenumberparts = "1" catenatewords = "1" catenatenumbers = "1" catenateall = "0"/>
  6. <Filter class = "SOLR. lowercasefilterfactory"/>
  7. <Filter class = "SOLR. removeduplicatestokenfilterfactory"/>
  8. </Analyzer>
  9. <Analyzer type = "query">
  10. <Tokenizer class = "com. ronghao. fulltextsearch. analyzer. chinesetokenizerfactory" mode = "Most-words"/>
  11. <Filter class = "SOLR. synonymfilterfactory" Synonyms = "synonyms.txt" ignorecase = "true" Expand = "true"/>
  12. <Filter class = "SOLR. stopfilterfactory" ignorecase = "true" words = "stopwords.txt"/>
  13. <Filter class = "SOLR. worddelimiterfilterfactory" generatewordparts = "1" generatenumberparts = "1" catenatewords = "0" catenatenumbers = "0" catenateall = "0"/>
  14. <Filter class = "SOLR. lowercasefilterfactory"/>
  15. <Filter class = "SOLR. removeduplicatestokenfilterfactory"/>
  16. </Analyzer>
  17. </Fieldtype>

Restart Tomcat and go to http: // localhost: 8080/SOLR/admin/analysis. jsp.

Experience the Chinese word segmentation of Ding. Note that you need to copy the paoding-analysis.jar to SOLR's lib, be sure to modify the home of the dictionary in the jar package.

[SOLR search operator]
":" Specifies the field to query the specified value. For example, all values are returned *:*
"?" Wildcard for any character
"*" Indicates the wildcard of Multiple Arbitrary characters (cannot be used in the retrieved item * or? Symbol)
"~" Indicates fuzzy search. For example, if the search spelling is similar to "Roam", write it as roam ~ The words like foam and roams are found. roam ~ 0.8. records with a returned similarity of more than 0.8 are retrieved.
Nearby search, for example, "Apache", "Jakarta", and "Jakarta Apache "~ 10
"^" Controls relevance retrieval. For example, if you want to retrieve Jakarta Apache and make "Jakarta" more relevant, add the "^" symbol and increment value after it, that is, Jakarta ^ 4 Apache
Boolean operators and, |
Boolean operators or ,&&
Boolean operators not ,! ,-(The exclusion operator cannot be used together with items to form a query)
The "+" operator requires that the item after the symbol "+" must exist in the corresponding domain of the document.
() Is used to form a subquery.
[] Include range search, such as searching records in a certain period of time, including headers and tails, Date: [200707 to 200710]
{} Does not contain range searches. For example, records in a certain period of time do not contain headers and tails.
Date: {200707 to 200710}
/Escape operator. special characters include +-& |! () {} [] ^ "~ *? :/

Lucene/SOLR development experience

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.