Geographic Queries on Google App Engine
Google App Engine is an interesting platform. It allows for you to run Python applications on Google’s platform, using BigTable based datasources as a backing store. Publishing applications to App Engine is trivial, and if you’ve written your application to support WSGI, porting to getting things running on App Engine is pretty easy.
However, there are some limitations placed on BigTable datastources that make doing geographic queries more difficult than it might otherwise be. Understanding of querying/indexing two-dimensional/geographical data is something that is typically a database extension — in Postgres, this is the PostGIS extension to the database, for example. BigTable has no inherent understanding of this type of data, so we have to drop to a more naive implementation. A first pass approximation at bounding box queries might be to create a four-tuple of columns in your datasource: left, bottom, right, top, and store the extent in that datasource. In a typical relational database, this would work reasonably well, though your database wouldn’t truly be indexing both dimensions at once: instead, you’d be using a single dimensional index, and then filtering the result as a second pass.
App Engine doesn’t allow this type of query, however: an application can’t perform inequality queries on more than one field at once. The naive implementation, of x < right and x > left, then, will fail: left and right are different fields, and you can’t do queries of that type.
So, the key is clearly to encode the bounding box of features in a way that embeds the data into a single string, which represents the bounding box accurately, but is also ‘ordered’ — such that a point represented by ‘123′ is near a point represented by ‘1234′.
The default response to this — for points — is to use Geohash (Wikipedia). Geohash embeds location information for a point into a single, ordered string, for which you can drop precision by simply dropping bits off the end of the string. For points, this actually works relatively well: a given point will be between two geohash values for the corners of a bounding box. So, if you have -71,42, and a Geohash of ‘drmwbmg7sp6yf’, and you do a query for -72,41 -> -70, 43, you have limits of ‘drkc1xg4q3uy8′ and ‘drwhx5gt1ubzk’. Clearly, the former is between the latter two: You can simply do a database query of (hash < 'drwhx5gt1ubzk' AND hash > ‘drkc1xg4q3uy8′), which App Engine has no problem with.
However, this doesn’t scale equally to non-point features. You can find the ‘union’ of two Geohashes by taking the bits that are the same from each and putting them together: in the case above, the bounding box of -72,41,-70,43 is the bitstring ‘01100101111′: however, because bboxes are not infinitely precise (/infinitely small) as points are, there is no way to assign precision to the string. This means that the ‘cutoff’ results in queries that don’t match up:
>>> a = str(geohash.Geostring((-72.01,40.99))) >>> b = str(geohash.Geostring((-69.99,43.01))) >>> c = str(geohash.Geostring((-72,41)) + geohash.Geostring((-70,43))) >>> c >= a False >>> c <= b True
Clearly, these should both return true: the bounding box of C is contained within the points, but because of the ‘infinite accuracy’ of points, this doesn’t actually work.
Instead, we need to add an additional mode to each of our bits: instead of just storing 0 or 1, we need to store an ‘inbetween’ bit to indicate that there is ‘no known data’. (This fixes the problem that the null string is less than 0.) In order to do sorting properly, we need this to be between ‘0′ and ‘1′, so we instead move to ‘0′, ‘2′, and ‘1′ for inbetween. We call this class the Geoindex class. For our ‘c’ above, this turns 01100101111 into:
0220020222211111111111111111111111111111111111111111111111111111
Now, we do the same test as before, with the new Geoindex class:
>>> a = str(geohash.Geoindex((-72.01,40.99))) >>> b = str(geohash.Geoindex((-69.99,43.01))) >>> c = str(geohash.Geoindex((-72,41)) + geohash.Geoindex((-70,43))) >>> c >= a True >>> c <= b True
Hooray! It worked as we expect.
Geohash here is a public-domain Python module created by Schuyler Erle, available from Mapping hacks Code: Geohash. Schuyler was the brainchild behind all the Geohash work: I just told him what did and didn’t work
In this way, I was able to put together a geographic bounding box query, on top of Google App Engine, using a Geohash-like algorithm as a storage format, and use that query to power a FeatureServer Demo App Engine application, doing geographic queries of non-point features on top of App Engine/BigTable. Simply create a Geoindex object of the bounding box of your feature, and then use lower-left/upper-right points as bounds for your Geohash when querying.
(A later post will detail how to set up your own copy of this App Engine app: I’m still working out a number of kinks with it.)
App Engine provides a number of interesting capabilities; for geographic applications, bounding box queries are very important, and using this solution, you can do queries against this type of data with some success.
May 28th, 2008 at 1:31 pm
The Geohash concept looks like it is isomorphic to the concept of the Z-order linearization of space.
This indexing scheme of “truncated geohashes” then is isomorphic to the long-established concept of using Quadtrees as a index structure.
Quadtrees is a viable spatial indexing technique for a planar space. There are issues with using it for indexing data on the sphere - it doesn’t handle objects which cross a pole, or which cross the data line.
May 28th, 2008 at 4:20 pm
Really, I only think in planar space anyway. Spheres are for losers!
May 29th, 2008 at 6:34 pm
I’m super-intrigued by all this stuff, but I’m not sure I’ve fully wrapped my head around it. How /does/ the bounding box query handle the equator, date-line and meridian?
May 29th, 2008 at 6:43 pm
Dateline isn’t handled at all: it’s a planar map, and the plane doesn’t cross the dateline.
Equator and Meridian are problematic. In the end, this is an approximation: In order to determine features which actually intersect, you have to do a post-query limit. (This is also true in PostGIS: You do a “the_geom && bbox AND ST_Intersect(the_geom, bbox)” to get features which actually intersect.) Features which cross a top-level boundary like the equator or dateline will cover the entire range of the world in that dimension. For this demo, this works out okay because such features are usually excluded in the other direction: you end up with (for example):
10101210where the ‘1111′ is the dimension in which the feature has crossed the top-level boundary of the meridian or equator. Then, because you query is looking only for features > ‘01′ and less than ‘22′ (for example), you won’t hit that feature.
This isn’t a perfect solution: it doesn’t push all the things into the datasource that you actually need. But it provides a first-order limit, which you can use to pull out a mostly-correct list that you can filter on.