Whether it's looking for the nearest café via a GPS-enabled smartphone, or finding friends near you through social networking sites, or looking at all the trucks that transport certain goods in a particular city, more and more people and businesses are using location-based search services. Creating a location-aware search service is usually part of an expensive, dedicated solution, and is typically done by geo-space experts. However, the popular open source search library Apache Lucene and the powerful Lucene search server Apache SOLR recently added a space location feature.
Geographical location in space search is essential! Geographical location is not only the Supreme king of the property, it can also be used in search to help users in specific locations to quickly find useful information. For example, if you are an enterprise directory provider (such as a "Yellow Pages" site), when a user needs to find a plumbing repairman, the site must return to a repairman near the user's home. If you are running a tourist site, you must allow travelers to search for places of interest near their location to help them to enrich their travel itinerary. If you're building a social networking site, it's best to use location information to help users get in touch with friends. The popularity of location-aware devices (such as car navigation systems and GPS-enabled cameras) and a large number of free map data offers various opportunities for building geographical Information Systems (GIS) that can search for advanced results for end users.
Space information can also be leveraged outside of the search field, but in this article I will focus primarily on how to use spatial information to improve search applications through Apache and Apache SOLR. Why should I use a search engine? Not because it is a necessary part of many good (and even free) GIS tools. However, building applications on the basis of search engines can provide several powerful features that other traditional approaches cannot achieve. The search system is very robust in terms of both structured and unstructured consolidation, which allows users to enter free form queries, such as limiting or modifying results based on geographical data while searching for descriptions and headings for free text. For example, a tourist site can implement a feature that allows users to find a four-star hotel in Boston, Massachusetts, with a comfortable bed for all 24 hours of service. Some search systems, such as Apache SOLR, also provide the ability to categorize, highlight, and spell result sets so that applications can help users find the desired results efficiently.
I'll start with a brief introduction to some of Lucene's key concepts, with in-depth details left to the reader to explore. Next, I'll introduce some basic geo-spatial search concepts. GIS is a broad area and it is difficult to describe it in detail, so I'm only concerned with some of the basic concepts needed to find services, people, and other day-to-day issues. The end of this article is a discussion of ways to use Lucene and SOLR indexes and search space information. I will illustrate these concepts in a real but simple example, and use data from OpenStreetMap (OSM) projects.
Review the key Lucene concepts
Apache Lucene is a high-performance search library based on Java™. Apache SOLR is a search server that uses Lucene to provide search, classification, and so on via HTTP. They all use the affordable Apache Software License.
In essence, both SOLR and Lucene represent content as documents. A document consists of one or more fields and an optional enhancement (boost) value that indicates the importance of the document. A field consists of the actual content that needs to be indexed and stored, the metadata that tells Lucene how to handle the content, and an enhanced value that indicates the importance of the field. It is up to you to decide how to represent content as documents and fields, depending on how you want to search or access the information in your document. In each content unit, you can use a one-to-one relationship, or you can use a one-to-many relationship. For example, I can choose to represent a Web page with a document that contains several fields, such as title, keywords, and body. If it is a book, I choose to represent each of its pages as a separate document. As you'll see later, this distinction is important when you're searching for code space data. You can index the contents of a field or store it as it is used by your application. If an index is established for the content, the application can use it. You can also analyze indexed content to generate words (often called tokens). Vocabulary is the basis for searching and using in the search process. Words are usually a word, but this is not necessary.
In terms of queries, Lucene and SOLR provide rich functionality for expressing user queries (from underlying keyword queries to phrase and wildcard queries). Lucene and SOLR also provide the ability to limit space by applying one or more filters that are important for space search. range queries and scope filters are key mechanisms for limiting space. In a range query (or filter), the user declares that it is necessary to limit all searched documents between two values that use natural sorting. For example, you typically use a range query to find all documents that have occurred in the past year or one months. During processing, Lucene must enumerate the words in the document to identify all documents within the scope. As I'll show later, setting the range query correctly is one of the key factors in improving the query performance of the space search application.
Lucene and SOLR also provide the concept of function queries, which allow you to use the values of the fields (such as longitude and latitude) as part of the recording mechanism, rather than just using the internal data collection that makes up the primary recording mechanism. This feature is used when I demonstrate the use of some distance based functions in SOLR.