Use Apache SOLR for Enterprise Search

Source: Internet
Author: User
Tags apache solr xml example
Provided based on the Lucene search engine and open-source with Apache Software License license. SOLR is (based on the Lucene site) "an open-source enterprise search Server Based on Lucene Java Search Library, with XML/HTTP and JSON APIs, highlighted hit results, and face-to-face combination search, cache, replication, and Web management interfaces ".

It is worth noting that large-Traffic web sites, Netflix, Digg, and CNET News.com and CNET reviews use SOLR to enhance the search function. The long string list of public sites driven by SOLR can be found in the SOLR Wiki (see references ).

Learn how to use SOLR and PHP to create a small application for searching the auto parts database. Although the sample database only contains some records, it can easily contain millions of records. All source code used in this article can be obtained from the download section.

Install SOLR

To use SOLR in combination with PHP, you must install SOLR, design the index, prepare the data to be indexed by SOLR, load the index, and write the PHP code to execute the query and display the results. Most of the work required to create a searchable index can be performed through the command line. Of course, SOLR's PHP programming interface will also affect the indexing content.

SOLR is implemented using Java technology. To run SOLR and its management tools, you must install Java v1.5 (Java 5 SDK ). Several providers provide Java v1.5 sdks-for example, Sun Microsystems, IBM, and BEA Systems-and each implementation can drive SOLR. You only need to select the Java package for your operating system and follow the instructions to complete the installation.

In many cases, installing Java v1.5 is as simple as running a self-extracting archive and accepting license agreement terms. The scripts in the archive can complete most of the difficult tasks in a few seconds. Other operating systems (such as Debian) will provide the Java 5 SDK In the apt system library. For example, if Debian or Ubuntu is used, you can usesudo apt-get install sun-java5-jdkInstall Java v1.5 software.

APT will also automatically download all the dependencies required to use the Java 5 SDK, which is very convenient.

If the Java software has been installed and the Java executable file is already inPATH, Runjava -versionTo determine the Java code.

Here, let's use the Mac OS X v10.5 Leopard operating system as the basis for the demonstration. Apple's leopard comes with Java v1.5. Leopard can also run PHP applications as long as the default configuration of Apache is slightly changed. Run in the leopard terminal windowjava -versionThe following output is generated.

Listing 1. Running in the leopard terminal windowjava -version

                
$ which java
/usr/bin/java

$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

Note: Leopard allows you to switch back and forth between Java v1.4 and v1.5 in the/applications/utilities/Java Preferences application. If leopard installation shows v1.4, open Java Preferences and change the settings as shown in Figure 1.

Figure 1. Java Preferences application in leopard

To install SOLR, visit apache.org and clickResources> downloadSelect a project image for easy access, and browse the tarball (. tgz file) of SOLR v1.2 in the folder shown in ). Download will transfer the name is similarApache-solr-1.2.0.tgz. Decompress the tarball with the following code.

Listing 2. decompress the tarball

                
$ tar xzf apache-solr-1.2.0.tgz

$ ls -F apache-solr-1.2.0
CHANGES.txt NOTICE.txt dist/ lib/
KEYS.txt README.txt docs/ src/
LICENSE.txt build.xml example/

In the newly created directoryDistThe folder contains SOLR code bound to the Java archive (jar. The subdirectory example/exampledocs contains formatted data examples-typically XML code-and is prepared for SOLR indexing.

ExampleThe directory contains a complete example SOLR application. To run it, you only need to use the application archive start. jar to start the Java engine.

Listing 3. Start the Java Engine

                
$ java -jar start.jar
2007-11-10 15:00:16.672::INFO: Logging to STDERR via org.mortbay.log.StdErrLog
2007-11-10 15:00:16.866::INFO: jetty-6.1.3
...
INFO: SolrUpdateServlet.init() done
2007-11-10 15:00:18.694::INFO: Started SocketConnector @ 0.0.0.0:8983

Applications can now be used on port 8983. Start the browser and typehttp://localhost:8983/solr/admin/. This is the interface used to manage SOLR (to stop the SOLR server, TypeCTRL + cKey combination ).

However, no data is available for management or query in SOLR indexes.



Back to Top

Load data into SOLR

SOLR is flexible and supports various data types and rules for creating valid indexes. In addition, although SOLR supports a wide range of data types and rules, if the standard components are not enough, you can further customize SOLR by writing new Java classes.

Given a set of data types and rules, you can create a SOLR mode to describe data and control how indexes should be constructed. Then export the data to match the pattern and load the data into SOLR. SOLR dynamically creates indexes and updates each index immediately when a record is created, modified, or deleted.

You can find the default SOLR mode in the SOLR source code library of apache.org. For reference, the following shows the code snippets in the default mode.

Listing 3. Default SOLR mode code snippet

                
<schema name="example" version="1.1">
...
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text" indexed="true" stored="true"/>
<field name="nameSort" type="string" indexed="true" stored="false"/>
<field name="cat" type="text" indexed="true" stored="true" multiValued="true"/>
...
</fields>

<uniqueKey>id</uniqueKey>
...
<copyField source="name" dest="nameSort"/>
...
</schema>

You do not need to describe most of the content of the mode, but you need to pay attention to the following aspects:

  • As shown in, FieldidIs a string (type="string"And should be indexed (indexed="true"). It is also a required field (required="true"). In this mode, each record that loads SOLR must provide a value for this field.<uniqueKey>id</uniqueKey>Modifier descriptionidThe field must be unique (SOLR does not require the ID field to be unique; this is only the rule created in the default index mode ). Attributestored="true"IndicatesidFields should be searchable.

    Why notstoredSetfalse? You can use unsearchable fields to sort the results in different ways. For example, you can usenameSort, It isnameCopy of the field (in the last linecopyFieldCommand), but the behavior is different. Note,nameSortYesstring, AndnameYestext. The default index mode is slightly different for processing the two types.

  • FieldcatYesmultiValued. You can define multiple values for this field. For example, if an application manages content, you can specify multiple titles for an article. You can usecatFields (or custom similar fields) to capture all titles.

Listing 4 shows the example/exampledocs/ipod_other.xml file, which represents two entries in the iPod attachment category.

Listing 4. Data formatted in the default SOLR Index Mode

                
<add>
<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>

<doc>
<field name="id">IW-02</field>
<field name="name">iPod & iPod Mini USB 2.0 Cable</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter for iPod, white</field>
<field name="weight">2</field>
<field name="price">11.50</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
</add>

addAn element is a SOLR command used to add an encapsulated record to an index. Each record will be captureddocElement, which uses a groupfieldTo specify the field value. Fieldweight,price,inStock,manu,featuresAndpopularityAll other fields defined in the default SOLR index mode.featuresField ownership andcatThe same attribute, but the meaning is different: it lists the features of the product, the number may be large.



Back to Top

Search for Auto Parts

In this example, the auto parts set is indexed. Each component has multiple fields. Table 1 shows the most important field samples. Field names are listed in the first column. The second column provides a brief description, and the third column lists the logical types. The fourth column displays the index type used to represent the data (as defined in the pattern in listing 5 ).

Table 1. Fields of auto parts record

Name Description Type SOLR type
Part number (unique, mandatory) ID number String partno
Name Brief Description String name
Model (required, multi-value) Model, such as "Camaro" String model
Model Year (multi-value) Model year, such as 2001 String year
Price Unit Price Floating Point price
Inventory Inventory? Boolean inStock
Function Functions of Parts String features
Time Mark Activity records String timestamp
Weight Shipping Weight Floating Point weight

Listing 3 shows the SOLR mode section used by the auto parts index. Most of them are based on the default SOLR mode. The specific field used -- name and attribute -- is replaced by the one found in the default mode.fieldsElement (as shown in Listing 1 ).

Listing 5. Auto Parts Index Mode

                
<?xml version="1.0" encoding="utf-8" ?>
<schema name="autoparts" version="1.0">
...
<fields>
<field name="partno" type="string" indexed="true"
stored="true" required="true" />

<field name="name" type="text" indexed="true"
stored="true" required="true" />

<field name="model" type="text_ws" indexed="true" stored="true"
multiValued="true" required="true" />

<field name="year" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" />

<field name="price" type="sfloat" indexed="true"
stored="true" required="true" />

<field name="inStock" type="boolean" indexed="true"
stored="true" default="false" />

<field name="features" type="text" indexed="true"
stored="true" multiValued="true" />

<field name="timestamp" type="date" indexed="true"
stored="true" default="NOW" multiValued="false" />

<field name="weight" type="sfloat" indexed="true" stored="true" />
</fields>

<uniqueKey>partno</uniqueKey>

<defaultSearchField>name</defaultSearchField>
</schema>

For the above fields, you need to export and format the auto parts database and upload it to SOLR, as shown in Listing 6.

Listing 6. Auto Parts database formatted for Indexing

                
<add>
<doc>
<field name="partno">1</field>
<field name="name">Spark plug</field>
<field name="model">Boxster</field>
<field name="model">924</field>
<field name="year">1999</field>
<field name="year">2000</field>
<field name="price">25.00</field>
<field name="inStock">true</field>
</doc>
<doc>
<field name="partno">2</field>
<field name="name">Windshield</field>
<field name="model">911</field>
<field name="year">1991</field>
<field name="year">1999</field>
<field name="price">15.00</field>
<field name="inStock">false</field>
</doc>
</add>

Let's install the new index mode and load the data into SOLR. First, useCTRL + cTogether to stop the SOLR daemon (if it is still running ). Create an archive of the existing SOLR mode in example/SOLR/CONF/Schema. xml. Next, create a text file in Listing 6, save it to/tmp/Schema. XML, and copy it to example/SOLR/CONF/Schema. xml. Create another file for the data shown in listing 7. Now you can restart SOLR and use the posting utility provided in the example.

Listing 7. Enable SOLR with New Mode

                
$ cd apache-solr-1.2/example
$ cp solr/conf/schema.xml solr/conf/default_schema.xml
$ chmod a-w solr/conf/default_schema.xml

$ vi /tmp/schema.xml
...
$ cp /tmp/schema.xml solr/conf/schema.xml

$ vi /tmp/parts.xml
...

$ java -jar start.jar
...
2007-11-11 16:56:48.279::INFO: Started SocketConnector @ 0.0.0.0:8983

$ java -jar exampledocs/post.jar /tmp/parts.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,
other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update...
SimplePostTool: POSTing file parts.xml
SimplePostTool: COMMITting Solr index changes...

Successful! If you need to check whether the index exists and contains two documents, point the browser to http: // localhost: 8983/SOLR/admin/again /. You should see "(autoparts)" at the top of the page )". If you see this, click the query box in the middle of the page and typepartno: 1 or partno: 2.

The result should be similar to the following:

3 on 10 0 partno: 1 OR partno: 2 2.2
true Boxster 924 Spark plug 1 25.0 2007-11-11T21:58:45.899Z 1999 2000
false 911 Windshield 2 15.0 2007-11-11T21:58:45.953Z 1991 1999

Try other queries. The Lucene wiki describes the Lucene query (search engine in SOLR) syntax (see references ).

You should also try to edit and load data again. As declaredpartnoThe field is unique. Therefore, when you upload the same part number repeatedly, you only need to replace the old index record with the new record. BesidesaddYou can also usecommit,optimizeAnddelete. The last command can delete a specific record by ID or query and delete multiple records.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.