the point of interest and name search in lbs location application app2015-12-26 22:15:18
We know that the United States group and the public comments on the 3 billion dollars involved in the weight-level merger is very eye-catching. In this merger, the United States is the main focus of the group is the public comment on the store poi (point of interest, users are interested in the general collectively) data. The public comment has 17 million stores poi data, the POI data twice times more than the group. In the United States Group of 8 million POI data, there are group buying transactions close to about 1.5 million, poi to deal with high conversion rate. and the public comments on the POI conversion rate, although not the United States Regiment high, but also the eyes of the United States Group of Sweet cakes. has become an important bargaining chip for this merger.
Since POI is so important, then poi how to design. And how to retrieve it.
design of POI system
The public comments that the way to get poi is very stupid, is sent to "sweep the street" to collect poi, that is: foreign workers to shop door to collect, and the use of GPs to determine the location, the industry personnel will be collected from the external industry data to the geographical database. This method is the same as the Poi collection method of the NavInfo.
After the collection poi is finished, the next step is to design the POI system. The core of POI system design is to speed up the retrieval of POI data. In order to realize the fast retrieval of POI, it is possible to set up a fine design storage and acquisition architecture for the POI system, to sort the data linearly, and to establish a spatial index (see the first issue of the spatial indexing method for lbs data in this series) ("Programmer's October B").
"Storage and acquisition of POI system"
The name of the POI can be duplicated, so in order to reduce the storage capacity and improve the speed of the retrieval, all the names can be stored uniformly. This allows many poi to correspond to a name, reducing the storage capacity of the POI name.
As with the name, because the POI icon can also be repeated, so in order to reduce storage capacity, improve the speed of retrieval, all the icons can be unified storage. A map data store less than 100 icons are full, and poi only need to have a reference to the icon.
Poi In addition to general information, is its depth of information. General information is the name of the store or telephone and other common information. Depth information such as: Store evaluation, store business hours, or maps need to update the data (such as POI data from the network) and so on. It can be imagined that the future map data must be interactive map, but also must be a network of maps. Therefore, the future map data may be a large part of the interactive depth of data, such as from other sites (Taobao, cat) the depth of information. This is in fact Baidu and gold and so already in the implementation of Baidu's many data is crawling from the web crawler. Many of Gould's POI data are also from Taobao/cat.
"Linear Sort"
If all the geometrical types are summed up as: Point, line, and surface data. Well, poi obviously belongs to the point type.
Point design can be very simple, such as: You can only consider to do a longitude and latitude point to the map can be, but also can be more elaborate design, in addition to building a better spatial index to retrieve, can also be sorted to point, so that more quickly to the point of retrieval.
There are two main ways to sort points: Hilbert curve sorting and Z-order. Both of these sort methods can be used to sort poi, thereby shortening the time required for name searches.
The sort of Hilbert curve
The Hilbert curve is a fractal curve (space filling curve) filled with a planar square, which was presented by the David Hilbert of the mathematical Cow in 1891.
Because it fills the plane, its Hausdorff dimension (Hausdorff-becikovich Dimesion, currently the main measure of the topological dimension) is 2.
The Hilbert curve essentially shows a sort, and this sort of L system notation is as follows.
Variables: L, R
Constants: F, +, −
Rules are:
l→+ rf−lfl−fr+
R→lf+rfr+fl
wherein, f denotes forward; − The right turn 90°;+ indicates a left 90°.
The steps for sorting are shown in Figure 1.
Fig. 1 Hilbert sort
Z Sort (Morton sort)
Like the Hilbert curve, Z-sorting is also a way to fill a multidimensional space with a curve (actually a sort of point).
Z-Ordering is a very important method, which we have described in the "spatial indexing method for lbs data". Here's how the sort effect looks, as shown in Figure 2.
Figure 2 Z Sort
Figure 2 Z Sort
Both of these sorting methods are the two-dimensional space into one-dimensional method, so that two-dimensional space can be sorted, can be quickly poi name retrieval.
"Retrieval of POI Systems"
Search is the main channel for users and lbs to interact. Various lbs applications vary, but the search technique can be summed up in terms of poi (point of Interest), name search as the expression, and recommendation system as the core technology (we will introduce the recommended technology in the next issue).
In fact, the recommendation technology is the search technology, because the recommended technology is often the ranking of the Web page (Google invented a page ranking technology), so the recommendation technology is the most core search technology.
However, the search technology, in addition to the user's most interesting content (recommendation system), but also the need for efficiency. There are often two techniques for improving the efficiency of search techniques: large hash (hash) technology, such as the massive data storage technology used in Hadoop, and the technology to build trees. One of the large-scale hashing technology in large search engines (such as Sogou Input method/Baidu search) in the use of more common, often is the text to build a word base, and then set up some large-scale hash of the inverted index, such technology to achieve a certain degree of difficulty.
The technology of building trees is simple and easy compared to the large-scale hashing technology. The most common technology for building trees is FTS (full-text search using B-trees) or automatic prompts.
Fts
FTS is often used with SQLite FTS3 or FTS4 similar technology to build, is essentially a use of B-tree technology, is relatively simple to use, such as: You can use one or two SQL statements in Sqlite3, you can achieve FTS search.
Auto Prompt
There are about 4 kinds of automatic tip technology.
On-line technology (using participle): this technology often has a huge database of word segmentation, using a large server's computing power, the word (Chinese or English) for the unit to generate automatic prompt vocabulary. This technology is commonly used in search engines, such as Google/Baidu.
Off-line technology (using Word segmentation): This technique is similar to the online word segmentation technique, which is usually generated by a pre generated or user-operated thesaurus. Automatically prompts the next word by using the word relationship between the thesaurus. The vocabulary generated by this technique is not a one by one corresponding Web page or text address. Therefore, this technique is often applied to the input method. If applied to an offline engine, you need to correspond to multiple pages or text.
Offline (without the use of word breakers) in this case, NVC (next valid character, next valid character) technique is often used.
NVC technology is widely used in foreign countries a technology, before the word segmentation technology, the earliest application in the car engine, and has been in the car engine dominance. This is because NVC technology can simultaneously generate a unique index of POI or roads while finding the word. Therefore, the method (3) has the characteristics of precision, simple technology implementation and simple maintenance compared with the method (1) and (2).
The NVC method for the name of the road or point of interest is as follows:
By storing the names of all the roads or points of interest in the database, and storing the NVC data beforehand, the next valid character can be judged beforehand, so that the automatic prompt for the next character is implemented in a very short time. The relationship between each name character is shown in Figure 3:
Fig. 3 Structure of NVC
Full-text prompts for the name. This method approximates the automatic completion function, which prompts the name string of 1 or more paths or points of interest in the Drop-down box of the name by the first few characters entered by the user. Like what:
Figure 4 Auto Tip
In general, the methods used in (1), (2) are very good, and have been widely used in large search engines, such as Baidu, Google or Sogou input method. But because this method development cost is high, and to the hardware or the network request is also high, therefore does not have the widespread application in the high-end vehicle navigation product and the low-end lbs product.
The main reason for the low end lbs products is the relationship between development costs. Not being mainstream in high-end car navigation because, high-end vehicle navigation needs to be accurate, fast and more user-friendly than the interface, for example, with the ①, ② method, if not expensive research and development costs, often can not meet the "user input vocabulary and corresponding Web/text address one by one corresponding" requirements. For example, users want to find: Wangjing commercial building, to meet the requirements of the POI may have many, and users really want to look for the poi because of the unpopular poi, if the use of the ①, ② practices, may be placed on the first page, or neglected. This situation is not acceptable for high-end car navigation, especially in the current depot where the thinking is still relatively primitive.
Therefore, the above four methods in the implementation of LBS, each has its own advantages, developers need to be based on their own research and development capabilities and needs to develop appropriate technical solutions.
To sum up, we have finished POI's design and name search. Because both are for search and service, the recommendation system is the core technology of search. So, we'll cover the recommendation system in the next issue.
Author Introduction
Jia Dicheng, Alibaba Senior engineer, specializes in data compilation, data mining system analysis and architecture design, led the development of a number of high-end vehicle navigation and ADAS data compilers. Has published invention patents, papers more than 40, with the "core technology lbs", "Data Mining core technology Secrets"