Geographic Queries on Google App Engine
Wednesday, May 28th, 2008Google App Engine is an interesting platform. It allows for you to run Python applications on Google’s platform, using BigTable based datasources as a backing store. Publishing applications to App Engine is trivial, and if you’ve written your application to support WSGI, porting to getting things running on App Engine is pretty easy.
However, there are some limitations placed on BigTable datastources that make doing geographic queries more difficult than it might otherwise be. Understanding of querying/indexing two-dimensional/geographical data is something that is typically a database extension — in Postgres, this is the PostGIS extension to the database, for example. BigTable has no inherent understanding of this type of data, so we have to drop to a more naive implementation. A first pass approximation at bounding box queries might be to create a four-tuple of columns in your datasource: left, bottom, right, top, and store the extent in that datasource. In a typical relational database, this would work reasonably well, though your database wouldn’t truly be indexing both dimensions at once: instead, you’d be using a single dimensional index, and then filtering the result as a second pass.
App Engine doesn’t allow this type of query, however: an application can’t perform inequality queries on more than one field at once. The naive implementation, of x < right and x > left, then, will fail: left and right are different fields, and you can’t do queries of that type.
So, the key is clearly to encode the bounding box of features in a way that embeds the data into a single string, which represents the bounding box accurately, but is also ‘ordered’ — such that a point represented by ‘123′ is near a point represented by ‘1234′.
The default response to this — for points — is to use Geohash (Wikipedia). Geohash embeds location information for a point into a single, ordered string, for which you can drop precision by simply dropping bits off the end of the string. For points, this actually works relatively well: a given point will be between two geohash values for the corners of a bounding box. So, if you have -71,42, and a Geohash of ‘drmwbmg7sp6yf’, and you do a query for -72,41 -> -70, 43, you have limits of ‘drkc1xg4q3uy8′ and ‘drwhx5gt1ubzk’. Clearly, the former is between the latter two: You can simply do a database query of (hash < 'drwhx5gt1ubzk' AND hash > ‘drkc1xg4q3uy8′), which App Engine has no problem with.
However, this doesn’t scale equally to non-point features. You can find the ‘union’ of two Geohashes by taking the bits that are the same from each and putting them together: in the case above, the bounding box of -72,41,-70,43 is the bitstring ‘01100101111′: however, because bboxes are not infinitely precise (/infinitely small) as points are, there is no way to assign precision to the string. This means that the ‘cutoff’ results in queries that don’t match up:
>>> a = str(geohash.Geostring((-72.01,40.99))) >>> b = str(geohash.Geostring((-69.99,43.01))) >>> c = str(geohash.Geostring((-72,41)) + geohash.Geostring((-70,43))) >>> c >= a False >>> c <= b True
Clearly, these should both return true: the bounding box of C is contained within the points, but because of the ‘infinite accuracy’ of points, this doesn’t actually work.
Instead, we need to add an additional mode to each of our bits: instead of just storing 0 or 1, we need to store an ‘inbetween’ bit to indicate that there is ‘no known data’. (This fixes the problem that the null string is less than 0.) In order to do sorting properly, we need this to be between ‘0′ and ‘1′, so we instead move to ‘0′, ‘2′, and ‘1′ for inbetween. We call this class the Geoindex class. For our ‘c’ above, this turns 01100101111 into:
0220020222211111111111111111111111111111111111111111111111111111
Now, we do the same test as before, with the new Geoindex class:
>>> a = str(geohash.Geoindex((-72.01,40.99))) >>> b = str(geohash.Geoindex((-69.99,43.01))) >>> c = str(geohash.Geoindex((-72,41)) + geohash.Geoindex((-70,43))) >>> c >= a True >>> c <= b True
Hooray! It worked as we expect.
Geohash here is a public-domain Python module created by Schuyler Erle, available from Mapping hacks Code: Geohash. Schuyler was the brainchild behind all the Geohash work: I just told him what did and didn’t work
In this way, I was able to put together a geographic bounding box query, on top of Google App Engine, using a Geohash-like algorithm as a storage format, and use that query to power a FeatureServer Demo App Engine application, doing geographic queries of non-point features on top of App Engine/BigTable. Simply create a Geoindex object of the bounding box of your feature, and then use lower-left/upper-right points as bounds for your Geohash when querying.
(A later post will detail how to set up your own copy of this App Engine app: I’m still working out a number of kinks with it.)
App Engine provides a number of interesting capabilities; for geographic applications, bounding box queries are very important, and using this solution, you can do queries against this type of data with some success.