Spatial index-Geohash algorithm and its implementation optimization

Source: Internet
Author: User
Tags comparison table

Objective

The use of spatial indexes and the support of multiple databases to spatial indexes are mentioned in the previous blog, so the learning partners should consider the principle of spatial index implementation below the application layer.

At present, the implementation of spatial index has r-tree and its variant gist tree, Quadtree, grid index and so on. The grid index is no longer mentioned, using a common hash table to store the map between the location and the style. Today the Geohash algorithm implements the spatial index, although it is implemented as a B-tree, but I think it also borrows part of the idea of a grid index.

Geohash principle

The principle of the Geohash algorithm is very simple to say, such as:

    1. The whole square paper is divided into two parts, the left part is marked 0 , the right part is marked 1 ;
    2. Then the red dot is divided into about two blocks, and then the red dot position to do the same logo, and finally the red dot in the horizontal logo 10 ;
    3. In the longitudinal on the square paper to do the same division, the left side is identified as, the 0 right side is identified as 1 , the red dot position in the vertical identification 01 ;
    4. The horizontal and vertical identities are merged, the rule is 纵向在奇数位,横向在偶数位 (or the opposite, but to be consistent throughout the system), the red dot on the square paper is identified as 1001 ;

Just marking a square doesn't seem to be a rule, and if we identify all of these spaces, we'll see them 被划分在角落里的四个方格会有同样的前缀, as shown.

The same prefix means that a B-tree index can be used to find points with the same prefix as nearby points, and the Geohash algorithm is the same prefix.

Mercator projection

The Mercator projection is a positive-axis conformal cylindrical projection. Founded by the Dutch map biologist Mercator (G.mercator) in 1569. Suppose a cylinder that is aligned with the axis of the earth is cut or cut to the globe, and the graticule is projected onto the cylindrical surface by the isometric condition, and the cylindrical face is projected to be a plane. The first and most commonly used projection of the Mercator projection in the tangent cylindrical projection and the cylindrical projection is the tangent cylindrical projections.

Mercator projection simply said, is yes 把整个地球平面作为一个正方形来处理 , of course, the earth plane is not a strict square, this projection in the near-polar points will have errors, this article focused on the principle, the rectification will not mention (I do not understand, escape).

Realize

According to the plane of Mercator projection, we can divide the whole earth surface into small squares according to the way of dividing the square paper above.

such as (116.276349, 40.040875) The Longitude division of this point:

    1. Longitude is identified in the range [ -180,0] 0 , and the longitude range is identified in [0, 180] degrees 1 ;
    2. Continue dividing, the longitude range in [0,90] is identified as the 0 longitude range in [90,180] is identified as 1 ;
    3. In this way, we divide 20 times, the accuracy of the square (see the end of the text table) has reached 2m, to obtain the longitude of the identification binary string 11010010101011110111 ;
    4. To the latitude is likewise divided, which gets the identification of the latitude binary string for 10111000111100100111 ;
    5. We combine it to get a binary string of 40 bits 11011 01110 00010 01110 11100 10111 01001 11111 ;
    6. We will use this binary string base32 encoding (principle with Base64, can see my other article: Web development in the character set and encoding, bit encoding mapping table below), get geohash encoded as 3OCO4XJ7 ;

The Geohash encoding prefix is 3OCO4XJ7 the point that is within two meters (116.276349, 40.040875). If we put the location point and its Geohash code into the database, we need to find the point near the two-meter point, only the qualification geo_code like ‘3OCO4XJ7%‘ is required;

Boundary Point problem

But the simplest version of Geohash also has a weakness, such as:

If the accuracy of each square is 2km, then we directly follow the prefix to query the point of the 2km near the red dot is not found near the black spot.

To solve this problem, we need to consider the eight squares around it, and then iterate through the points in the squares and the surrounding eight squares, and return to the points that meet the requirements. So how do you know the prefixes of the perimeter squares?

Looking closely at the adjacent squares, we will find that two squares will be in longitude or latitude 二进制码上相差1 ; After we parse the binary code by Geohash code backwards, we add the binary code of longitude or latitude (or both) to the Geohash code again.

The GEO function problem of Redis

Our common requirement is to find the points in the N-meter range, so how does the mapping between n-meters and geohash yards be implemented? Since the Geohash code is composed of 5-bit binary code, each one less, the accuracy will be lost 2e(5/2) .

Method of course, we will be the binary Geohash code directly indexed, but a long index length will lead to the B-Tree index query efficiency will quickly decline.

Scheme

So we went on to find a solution, since the use of BASE32 conversion to 32 binary code will not be a good control accuracy, keep the binary and cause the index length is too long, then the number of binary and index length is there a balance?

In addition, Redis's sorted set supports the 64-bit double type of score, we convert the binary Geohash code into decimal into the sorted set of the Redis, not the query efficiency of log (n) can be achieved.

To be honest, the first time I saw Redis's GEO-series functions, my heart was broken, and I felt that my very good design had already been realized (though this often happens) ...

Of course I can't just forget it, so I used PHP to build the wheel ...

The main steps are as follows:

Code implementation

In the implementation I set the maximum precision of the Geohash to 26 bits, at which point the distance accuracy is 0.3m. Of course we can also take advantage of Redis's sorted set score, which sets the precision to 32 bits, just using its double type.

Data warehousing:

The latitude and longitude is obtained through the Geohash algorithm to the binary Geohash code, and it is converted to decimal as the score of this point into the sorted set of Redis;

// GeoHash核心方法 传入float类型的度数和其对应的范围,经度和纬度公用方法public function getBits($loc, $range, $level = self::LEVEL_MAX) {    $bits = ‘‘;    for ($i = 0; $i < $level; $i++) {        $mid = ($range[‘min‘] + $range[‘max‘]) / 2;        if ($loc < $mid) {            $bits .= ‘0‘;            $range = [‘min‘ => $range[‘min‘], ‘max‘ => $mid];        } else {            $bits .= ‘1‘;            $range = [‘min‘ => $mid, ‘max‘ => $range[‘max‘]];        }    }    return $bits;}     

In addition, the PHP bindec($bin_str) method can quickly convert the binary string into decimal digits.

Get precision based on query range radius

As mentioned above, the accuracy is determined by the number of times the map is divided, divided the number of times, the scope is small, the data of the query is not complete, divided the number of times, the scope will be large, we will be too much loss of data filtering.

private function getLevel($range_meter){    $level = 0;    $global = self::MERCATOR_LENGTH;    while ($global > $range_meter) {        $global /= 2;        $level++;    }    return $level;}   

The idea of the above code comes from the Redis Geo function source code, which is really ingenious.

Under the Mercator projection, the Earth's surface can be viewed as a square, and its edges are the longest in the circumference of the Earth. and learn the geography of Junior High School we know: "The earth is a slightly flat, equatorial slightly bulge of the ball", then its longest perimeter is the equator circumference, so we learned that Mercator projection of the long side 2*PI*R=40075452.74M ;

So we take one side of the square to continue to divide two times, until the result of the division is just longer than the range radius, then it constitutes a block, is the square we need.

Data query

Data query, we need to get the minimum score value of the intermediate square and its range, the minimum score value is very simple, directly bits less than 52 bits in the back of the complement 0 .

Furthermore, in order to avoid the boundary point problem, we also need to get the score value range of the surrounding eight squares.

When we divide the map, we divide each one, adding longitude and latitude two bits, and at the highest precision, the difference between the maximum and the minimum of each square is 1. Thus, we obtain the difference between the maximum and minimum score values of a square by the following method.

private function getLevelRange($level) {    $range = pow(2, 2 * (self::LEVEL_MAX - $level));    return $range;}

By the solution of the boundary point problem mentioned above, the minimum score value of the surrounding eight squares is obtained.

Use the Redis ZRANGEBYSCORE key hashInt hashInt+range command to take all the points in the nine squares, and then traverse the nine squares to filter out the data that is not in line with each other.

Summary

It took more than more than 10 hours to finally complete the geohash completely, fully understanding that Geohash was not as simple as it was imagined. In addition to Geohash, Quadtree and R-Trees are said to have higher query efficiency and time to study.

If you feel that this article is helpful to you, you can click on the recommendation below to support me. Blog has been updated, welcome attention .

Reference:

Geohash Core Principle Analysis

Redis GEO Source Notes

Geohash-digit accuracy comparison table (Wiki encyclopedia):

Geohash Length lat bits LNG bits lat Error LNG Error km Error
1 2 3 ±23 ±23 ±2500
2 5 5 ±2.8 ±5.6 ±630
3 7 8 ±0.70 ±0.70 ±78
4 10 10 ±0.087 ±0.18 ±20
5 12 13 ±0.022 ±0.022 ±2.4
6 15 15 ±0.0027 ±0.0055 ±0.61
7 17 18 ±0.00068 ±0.00068 ±0.076
8 20 20 ±0.000085 ±0.00017 ±0.019

BASE32 Encoding Mapping Table:

Value Symbol Value Symbol Value Symbol Value Symbol
0 A 9 J 18 S 27 3
1 B 10 K 19 T 28 4
2 C 11 L 20 U 29 5
3 D 12 M 21st V 30 6
4 E 13 N 22 W 31 7
5 F 14 O 23 X
6 G 15 P 24 Y
7 H 16 Q 25 Z
8 I 17 R 26 2

Spatial index-Geohash algorithm and its implementation optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.