Build enterprise-level Search Service SOLR

Source: Internet
Author: User
Tags solr

· What is SOLR?

SOLR is a Java search engine server built based on Lucene. It is a search center-type program.

· SOLR composition:
SOLR consists of a server program, several search module cores, and a set of Java client component solrj. A core runs on the server and can be understood as a search service provider for a website. Although multiple websites can be configured in a core, this is not recommended. All cores are stored in a directory named solrhome, which is a folder directory that must be configured during SOLR installation.

· SOLR features:
SOLR supports all Lucene functions and completely encapsulates these functions through HTTP and Java interfaces. SOLR adds enterprise-level features such as dataimport data import, UI management interface, online core management, and search cluster. SOLR supports many index data sources, including databases, data files (such as CSV), and even data sources such as HTTP, RSS, and email. This makes it easy for SOLR to cope with both newly developed systems and add full-text searches for old systems.

· SOLR Server installation:
SOLR Versions later than version 4.7.2 require jdk1.7 and later and the corresponding servlet container to be used. We use version 4.7.2 here. There is a war file under the DIST directory of SOLR's official package, decompress it to any servlet container, and add all the packages under example \ Lib \ ext to the project, which is required by logs, set log4j under example/resources. copy the properties file to the WEB-INF/classes directory as the log configuration file.
SOLR provides several solrhome folders under the example directory of its installation package. Each solrhome folder contains some core files that can be loaded, we can find a simple SOLR folder and set it to solrhome. Solrhome can be used to set the JNDI environment variables and virtual machine environment variables. The JNDI variable can be set in the context configuration section or in Web. xml. The absolute path must be set. In my project structure, the solrhome directory is under my project directory, so I used a custom filter to set the virtual machine environment variable system. setproperty ("SOLR. SOLR. home ","... ") to set solrhome.
Start the container and access this web project to see the solr ui management interface. The left side of the UI is the menu, and core admin can dynamically uninstall and install some core. The list of installed cores is shown below. Select the corresponding core. You can see analytics, dataimport, and query)

· Database Core installation:
The example directory contains the example-dih Directory, which is set to solrhome. This directory contains the core of the database data source, which is named dB. This core supports scanning the database and creating indexes by configuring SQL. To use this core, we need to put the two jar packages under the DIST directory that contain the dataimport word into the project directory.
DB core contains two major configuration files: db-data-config.xml database configuration, schema. xml index field configuration.
Db-data-config.xml is the place where you configure the database data source and scan SQL, which is responsible for connecting to the database, executing the SQL, and ing the data column to the fields in the index file. Note that the column name is case sensitive When configuring the field. If you use an Oracle database, because Oracle uses an upper-case column name by default, you must write the column in upper-case. In some cases, our data contains clob fields and rich text fields. SOLR provides some transformers converters that can pre-process these fields before saving the index. For example, you can convert clob to string clobtransformer, and you can cut all HTML-marked htmlstriptransformer. We can also define our transformers in Java or scripting language.
Schema. xml configures the field type types in the index file and the field fields required by the business. We can configure custom field types. For example, when a third-party word segmentation package is used, we can customize a field, specify its word segmentation, and stop the word filter. You can configure whether a business field is an index field index or a store field. It is useful for some fields that only need to be stored but do not need to be queried. Field can be specified as the multivalued type. In this case, a field can store the values of multiple database fields in the form of arrays. In combination with the copyfield element, the field can easily meet the query requirements of multiple fields. For example, for a like... or B like..., you can combine AB into a field C, and then query it in the form of C like.
After the core of the database is installed, you can find the dataimport menu on the main interface of SOLR, where you can perform operations such as cleaning, adding, deleting, and optimizing indexes.

· Word segmentation package installation:
When it comes to Lucene, we have to mention its powerful extension word segmentation package. I started to choose among Ding Jie Niu, mmseg4j, and ikanalyzer. Ding jieniu is unable to apply the new version of Lucene because it has not been updated for a long time. Mmseg4j comes with the sogou dictionary, which has a good response on the Internet. However, when mmseg1.9.1 runs under solr4.7.2, a bug forces me to modify its source code. In addition, mmseg does not support stopping words. The field type filter element is configured according to the online statement, and no effect is displayed during the query test. I don't know why the author of mmseg wants to cancel the word stop function, after all, the public is not so professional when using search engines. They will add a large number of meaningless words to the search engines, such as "Why ..... "," how ..... to interfere with the search engine. Therefore, the author finally adopted ikanalyzer (ik ).
Ik is updated quickly and can perfectly support the author's SOLR version. The installation of IK is simple. You only need to copy ikanalyzer. cfg. XML, stopword. DIC, and a jar package to classpath. Next, configure a field type of IK in scheme and point the field type to it. Then you can see the search results in the query test. Ik supports custom stopword dictionary and extended dictionary. For some professional websites, such as educational websites, agricultural websites, and subject websites, we recommend that you configure an extended dictionary to greatly improve word segmentation accuracy.
 
· Client solrj installation:
SOLR provides a Java client package solrj for the client program to use. Copy all the packages under Dist \ solrj-lib to the client project. The client has several core classes: solrsever, solrquery, queryresponse, and solrdocumentlist.
Solrsever is the server. Its query (solrquery query) method accepts a query request and returns queryresponse. queryresponse contains getelapsedtime for search and getresults for result set. Getresults returns solrdocumentlist, which contains all query results and the total number of getnumfound in the form of key-value pairs. Solrdocumentlist implements the list interface, which facilitates API isolation and encapsulation.
Solrquery is a class for constructing query conditions. SOLR has its own query syntax: It mainly includes field descriptions such as "field name: Value". It connects and supplements the logical symbols of fields such as or, And ,~ (Fuzzy), ^ (weight), sorting of query results, paging, filtering (query in results), grouping, and other functions. Send the querystring query string to solrquery for query. Of course, you can also set these conditions using separate API methods (similar to the criteria API query of hibernate ).
Because the queryparser on the solr server does not perform word segmentation on the query keywords, we will find that the SOLR keyword query is "exact query", which obviously does not meet the requirements. Therefore, it is necessary for the client to call the SOLR word segmentation interface for word segmentation before querying. There is no word segmentation interface in solrj's Java interface. We can call the getclient method of solrserver to obtain an httpclient object, and then call the corresponding HTTP interface to perform word segmentation. If Wt = JSON is added to the URL, the returned data format of the HTTP interface can be specified as JSON.

· Supplement:
Different SOLR versions have different APIs and configuration methods. The SOLR version used in this article is 4.7.2, the environment is tomcat6.0, and jdk1.6.

Build enterprise-level Search Service SOLR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.