Using SOLR to build full-text retrieval for enterprises (iii) --- Schema Definition

Last Update:2018-12-07 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PreviousArticleThis article introduces the SOLR management interface. We can use this interface to conveniently understand the current SOLR running status and how the current system is configured, you can even perform some tests and debugging through it, but it only stops at this point. You must also use various configuration files for system configuration. To enable SOLR to process our own documents, the first step is to configure the schema.

Schema is the core of SOLR business logic. What fields are contained in a document, whether the fields are indexed, how to index, and how to be queried are all defined in schema. We can find the schema. xml file in the conf directory of SOLR, which is the definition of schema. Note that a SOLR instance can have only one schema. The schema definition is like a table in the database. You define fields in the table, such as the text field and the data type is nvarchar. The difference is that in the database, you can only use the field type preset by the system to define fields. In SOLR schema, you can not only define fields, you can also define your own field types, and defining the field types is often the most important.

We can look at this schema file. In the <types> node, all content is field definitions. These field types are defined in one part, and each part has a detailed definition. For a simple field defined in each row, it is basically the basic data type of SOLR. Generally, you do not need to modify it. The omitnorms attribute of these fields is true, which means they will not be used for analysis, only used to store data. For faster range query, consider the field type with the T prefix. Let's take a look at the configuration defined by the following field:

< Fieldtype Name = "Text_general" Class = "SOLR. textfield" Positionincrementgap = "100" >

< Analyzer Type = "Index" >
< Tokenizer Class = "SOLR. standardtokenizerfactory" />
< Filter Class = "SOLR. stopfilterfactory" Ignorecase = "True" Words = "Stopwords.txt" Enablepositionincrements = "True" />
<! -- In this example, we will only use Synonyms at query time
<Filter class = "SOLR. synonymfilterfactory" Synonyms = "index_synonyms.txt" ignorecase = "true" Expand = "false"/>
-->
< Filter Class = "SOLR. lowercasefilterfactory" />
</ Analyzer >
< Analyzer Type = "Query" >
< Tokenizer Class = "SOLR. standardtokenizerfactory" />
< Filter Class = "SOLR. stopfilterfactory" Ignorecase = "True" Words = "Stopwords.txt" Enablepositionincrements = "True" />
< Filter Class = "SOLR. synonymfilterfactory" Synonyms = "Synonyms.txt" Ignorecase = "True" Expand = "True" />
< Filter Class = "SOLR. lowercasefilterfactory" />
</ Analyzer >

</Fieldtype>

The field type to be analyzed is generally like this. Name specifies the name of the field type, just like the nvarchar name of the database. Class indicates the Java data type corresponding to this type. In field definition, you can define analyzer. There are two types of analyzer: Index analyzer and query analyzer. For each field type, you can only specify one query analyzer and one index analyzer. The analysis is used to perform word segmentation, filtering, and conversion on the field content. We can see that a series of processing steps are defined in the analyzer node. These steps are sequential. From the analyzer type, you can intuitively understand that the index analyzer is used to create an index, and the query indexer is used to query. If only one analyzer is specified for the field type and no type is specified, this analyzer is used for both the index and query.

With the field type, we can define the fields of the document to be processed. We can see that many fields have been defined in the schema file, and they are all located in the <fields> node. These fields are prepared for the example data document. If you need to process the document easily and in English, you do not even need to modify the schema file, you can use these fields directly. Of course, this is just a lazy practice. If it is enough for practice, if it is used in the production environment, delete unnecessary fields. Do not delete the dynamicfield content. These dynamicfields have special meanings, their names all have a "* _" prefix like name = "* _ I. If you do not want to define this field in the schema and want to store the value of this field, you can use the suffix "_ I" When uploading the content of the local file to SOLR, the field value is based on the attributes of the dynamic field defined by <dynamicfield name = "* _ I" type = "int" indexed = "true" stored = "true"/>. storage, the same is true for queries. When defining a field, you can specify several attributes. At that time, it refers to the name of the field. Type specifies the field type. Of course, the field type refers to the previously defined type, this type determines how the content of the field is indexed and queried. indexed is a Boolean value indicating whether the field is indexed, and stored indicates whether the content of the field is stored, if your query only returns the hit or not, and does not return the content of the field or some parts of the highlighted content, you can set the value of this attribute to false, multivalued indicates whether the field stores multiple values.

Defaultsearchfield is used to specify the index value of the field to be queried if no field name is specified during the query.

Solrqueryparser indicates what logical operators are used by default if the query contains two terms and no logical operators are specified. In general, we use or by default.

The schema is defined above. When I write SOLR to process Chinese documents, I will introduce this part in detail. If you only process English documents, you do not need to modify the type. You only need to define the fields you need.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More