Schema.xml file Configuration

Last Update:2015-03-30 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Schema.xml is a configuration file in Solr that contains all the fields of your document and how those fields are processed when the document is indexed or queried. This file is stored in the Conf directory under the SOLR home folder, the default path./solr/conf/schema.xml, or it can be the path that the class loader of SOLR WebApp can determine. In the SOLR package that is downloaded, there is a schema sample file where the user can start to see how to write their own schema.xml.

Type node

First look at the type node, which defines the FieldType child nodes, including some parameters such as name, class, Positionincrementgap, and so on. Required Parameters:

Name: This is the FieldType.
Class: Refers to the class name in the Org.apache.solr.analysis package that defines this type of behavior.

Other Optional properties:

Sortmissinglast,sortmissingfirst two attributes are used on types that can be internally sorted using string, by default false, for field types: String, Boolean, Sint, Slong, Sfloat, Sdouble, Pdate.
Sortmissinglast= "true", where no data for the field is queued after the data for that field, regardless of the collation of the request, the corresponding meaning in Java is that it is null and is in the back.
Sortmissingfirst= "True", the collation is the opposite of Sortmissinglast.
Positionincrementgap: Optional attribute that defines the white space interval for this type of data in the same document, avoiding phrase match errors.

In configuration, the class of type string is SOLR. Strfield, and this field is not stored in the analysis, that is, it will not be participle.

For an article or a long text, we have to do word segmentation to ensure that certain fields are searched for the correct results. Then we can use another CLASS,SOLR. TextField. It allows users to customize indexes and queries through the parser, which includes a word breaker (tokenizer) and multiple filters (filter).

A standard participle:

<fieldtype name="Text_general" class="SOLR. TextField " positionincrementgap=" the"> <analyzer type="Index"> <tokenizer class="SOLR. Standardtokenizerfactory " /> <filter class="SOLR. Lowercasefilterfactory " /> </analyzer> <analyzer type="Query"> <tokenizer class="SOLR. Standardtokenizerfactory " /> <filter class="SOLR. Lowercasefilterfactory " /> <filter class="SOLR. Stopfilterfactory " ignorecase="true" words="Stopwords.txt" enablepositionincrements="true" /> </analyzer> </fieldType>

The participle is still fieldtype, in order to be used in the field below. There are two analyzer, one is index, one is Query,index is for all, and query is for search.

The Tokenizer node is of course the starting point in the corresponding analysis chain tokenizer. The next series of 2 filter, respectively, is SOLR. STOPFILTERFACTORY,SOLR. Lowercasefilterfactory. Stop word filter is to remove those words from token, such as the ', ' and ', ' because they appear very high in the document and have little effect on the characteristics of the document, so these words have little meaning to the query. The role of the Lower case filter is to convert all tokens to lowercase, that is, in the final index, all are lowercase

You can also define an analyzer, such as using mmseg4j for Chinese participle:

<fieldtype name= "Text_zh" class= "SOLR. TextField"positionincrementgap="><analyzer>< Tokenizerclass="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory"mode=  "Complex"/></analyzer></fieldType>

Filed node

The filed node is used to define the type of search and related settings used by the Data source field. Contains the following properties

Name: The data source field name that the search uses.
Type: Search name Ika For example Chinese, Text_ika, corresponding to name in FieldType. A string type that does not need a word breaker, string can be used, if a word breaker is required, the word type is configured above.
Indexed: is indexed, only fields that are set to true can be searched for sorted shards (earchable, sortable, facetable).
Stored: whether to store content, if you do not need to store the field values, try to set to false to improve efficiency.
Multivalued: Whether it is a multi-value type, SOLR allows you to configure multiple data source fields to be stored in a search field. Multiple values must be true, otherwise an exception may be thrown.
Omitnorms: Whether to ignore norm, can save memory space, only the full text field and need an index-time boost field need norm. (Do not understand, there are contradictions in the comments)
Termvectors: When set to true, the term vector is stored. When using Morelikethis, the field used as a similar word should be stored.
Termpositions: Storing address information in a term vector consumes storage overhead.
Termoffsets: Storing the offset of the term vector consumes the storage overhead.
Default: If no attributes need to be modified, you can use this tag.
This attribute was added to DOCVALUES:SOLR 4.2
Docvaluesformat: The optional value is disk or memory

Example:

<field name= "Manu_exact" type= "string" indexed= "false" stored= "false" docvalues= "true" />

Copyfield node

What if our search needs to search multiple fields? At this time, we can use Copyfield. The code is as follows:

<copyfield source= "Name" dest= "All" maxchars= "30000" /> <copyfield source= "Address" dest= "All" maxchars= "30000" />

Role:

Combine data from multiple fields to search simultaneously, providing speed
Copying data from one field to another can be indexed in 2 different ways

We copy all the Chinese word breakers to all, and when we do a full-text search, it's OK to just search the all field.

It contains properties:

Source: Feed Field fields
Dest: Target Field field
Maxchars: Maximum number of characters to copy

Note that the target field here must support multiple values, preferably not stored, as he just does the search. Indexed is true,stored to false.

Both the Copyfield node and the field node are within the fields node.

DynamicField node

Dynamic field, field with no specific name, with DynamicField field

For example, if name is *_i and the type of the definition is int, then the field in which the task is _i results is considered to conform to this definition when using this field. such as Name_i, school_i

<dynamicfield name= "*_i" type= "int" indexed= "true" stored= "true" /> <dynamicfield name= "*_s" type= "string" indexed= "true" stored= "true" /> <dynamicfield name= "*_l" type= "Long" indexed= "true" stored= "true" />

UniqueKey node

SOLR must set a unique field, often set to the ID, which is specified by the UniqueKey node.

For example:

<uniqueKey>ID</uniqueKey>

Defaultsearchfield node

The default search field, we have copied the fields we need to search to the all field, which is set to all.

<defaultSearchField>all</defaultSearchField>

Solrqueryparser node

The default search operator parameters, as well as the logic between search phrases, increase accuracy with and, increase coverage with or, suggest and, or define in search statements. For example, search for "phone apple", use and default search for "phone and Apple."

<solrqueryparser defaultoperator= "OR" />

Similarity node

A class in similarity lucene that is used to score a document during a search. This class can be modified to support custom sorting. In SOLR4, you can configure a different similarity for each field, and you can configure a global similarity with the Defaultsimilarityfactory class in Schema.xml.

You can use the default factory class to create an instance, for example:

<similarity class= "SOLR. Defaultsimilarityfactory "/>

You can also use other factory classes, and then set some optional initialization parameters:

<similarity class= "SOLR. Dfrsimilarityfactory "><strname=" Basicmodel ">P  </str><strname="aftereffect">L</str>  <strname="normalization">H2</str> <float name= "C" >7</float></similarity>

In Solr 4, you can configure each of the field:

<fieldtype name= "Text_ib" > <analyzer/> <similarity class= "SOLR. Ibsimilarityfactory "><strname=" distribution "> SPL</str><strname="lambda">DF</str ><strname="normalization">H2</str> </similarity> </fieldType>

In the example above, dfrsimilarityfactory and ibsimilarityfactory are used, and there are some other implementation classes. Sweetspotsimilarityfactory was added to Solr 4.2. Others are: Bm25similarityfactory, schemasimilarityfactory and so on.

Schema.xml file Configuration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More