Using Rtree spatial indexing with OGR

Spatial indexing can significantly reduce the time required to perform geoprocessing operations involving intersection. This includes intersections, unions, dissolves, and nearest neighbour analysis.

Spatial indexing speeds up queries by reducing the number of features that need to be evaluated with computationally expensive geometic calculations. It does this by performing a simplified query using the bounding boxes (a.k.a., envelopes) of the features only. A bounding box can always be described by four parameters -- the x,y coordinates of the lower left and upper right corners -- and as such these first pass queries can be done very efficiently using regular database indexing methods.

The Rtree module provides a Python interface to the spatial indexing functions in libspatialindex. The Rtree tutorial is clear, succinct and worth reading. See also the Wikipedia page on R-trees.

This post gives examples of how Rtree can be used with both the OSGeo OGR Python module and the Fiona and Shapely modules to speed up spatial queries.

Using a spatial index with osgeo.ogr

In the code below two polygon shapefiles are opened: layer 1 (districts) and layer 2 (building areas). Each district may contain many buildings, and a building may be in two districts. A spatial index is then created for layer 1, by adding the bounding box and FID of each feature to an Rtree Index instance. When creating an Rtree index object setting the interleaved keyword argument to False allows you to specify bounding boxes in the same order that OGR outputs.

from osgeo import ogr, gdalconst
import rtree

filename1 = 'districts.shp'
filename2 = 'building_areas.shp'

if __name__ == '__main__':    
    ds1 = ogr.Open(filename1, gdalconst.GA_ReadOnly)
    ds2 = ogr.Open(filename2, gdalconst.GA_ReadOnly)
    layer1 = ds1.GetLayer()    
    layer2 = ds2.GetLayer()

    index = rtree.index.Index(interleaved=False)
    for fid1 in range(0, layer1.GetFeatureCount()):
        feature1 = layer1.GetFeature(fid1)
        geometry1 = feature1.GetGeometryRef()
        xmin, xmax, ymin, ymax = geometry1.GetEnvelope()
        index.insert(fid1, (xmin, xmax, ymin, ymax))

The code below uses the spatial index to speed up a search for all features in layer 1 that intersect each feature in layer 2 (i.e., which district(s) is a building in). The envelope of each feature in layer 2 is calculated and checked for intersection against the spatial index. A list of FIDs for features in layer 1 is returned, which can then be checked for intersection more precisely using the Intersects method.

    for fid2 in range(0, layer2.GetFeatureCount()):
        feature2 = layer2.GetFeature(fid2)
        geometry2 = feature2.GetGeometryRef()
        xmin, xmax, ymin, ymax = geometry2.GetEnvelope()
        for fid1 in list(index.intersection((xmin, xmax, ymin, ymax))):
            feature1 = layer1.GetFeature(fid1)
            geometry1 = feature1.GetGeometryRef()
            if geometry2.Intersects(geometry1):
                print '{} intersects {}'.format(fid2, fid1)

The code below uses the same spatial index to speed up a search for all features in layer 2 that are within 500 metres of each feature in layer 1. To do this the envelopes of features in layer 1 are expanded by 500 metres in each direction to create a "search envelope". The search envelope is then checked for intersections against the spatial index. The exact distance to each feature returned is then calculated exactly using the Distance method.

    distance = 500.0
    for fid1 in range(0, layer1.GetFeatureCount()):
        feature1 = layer1.GetFeature(fid1)
        geometry1 = feature1.GetGeometryRef()
        xmin, xmax, ymin, ymax = geometry1.GetEnvelope()
        search_envelope = (xmin-distance, xmax+distance, ymin-distance, ymax+distance)
        for fid2 in list(index.intersection(search_envelope)):
            feature2 = layer2.GetFeature(fid2)
            geometry2 = feature2.GetGeometryRef()
            if geometry1.Distance(geometry2) <= distance:
                print '{} is near {}'.format(fid1, fid2)

Using a spatial index with fiona and shapely

The same algorithm as described above can be achieved using fiona and shapely. You'll need a recent version of Fiona, as accessing features by FID was only added in #c16f5fb.

Note that the bounding box returned by shapely is in the form [xmin, ymin, xmax, ymax] and so the interleaved argument to the index can be omitted (the default is True).

import fiona
from shapely.geometry import shape
import rtree

filename1 = 'districts.shp'
filename2 = 'building_areas.shp'

if __name__ == '__main__':
    with fiona.open(filename1, 'r') as layer1:
        with fiona.open(filename2, 'r') as layer2:

            index = rtree.index.Index()
            for feat1 in layer1:
                fid = int(feat1['id'])
                geom1 = shape(feat1['geometry'])
                index.insert(fid, geom1.bounds)

            for feat2 in layer2:
                geom2 = shape(feat2['geometry'])
                for fid in list(index.intersection(geom2.bounds)):
                    feat1 = layer1[fid]
                    geom1 = shape(feat1['geometry'])
                    if geom1.intersects(geom2):
                        print '{} intersects {}'.format(feat2['id'], feat1['id'])

Multiple filesystem reads vs memory caching

The observant reader will haved noticed that the examples above re-request a feature each time it's bounding box matches a query, using feature = layer.GetFeature(fid) with ogr or feature = layer[fid] with fiona. An alternate approach to this is to cache the features added to the index in memory, using a dictionary:

features = {}
index = rtree.index.Index()
for feat1 in layer1:
    fid = int(feat1['id'])
    geom1 = shape(feat1['geometry'])
    index.insert(fid, geom1.bounds)
    features[fid] = (feat1, geom1,)

The feature can then be quickly retrieved from the dictionary, like so:

for fid in list(index.intersection(geom2.bounds)):
    feat1, geom1 = features[fid]

The choice of which approach to use will depend on the situation, but my preference is towards the multiple-reads approach. The in-memory approach has the potential to be faster, but you may hit memory limits when processing large datasets. The operating system should be doing some caching of filesystem accesses anyway, so the impact of multiple requests should be minimal.

Getting the most out of your index

To get the most out of a spatial index it is important to select the correct layer to build the index of. This will depend on the data you have - if it's a query you're likely to repeat (e.g. as part of an application which processes similar datasets for different areas) some trials should give you an idea of where the most savings are to be made.

Spatial indexing works best when the indexed features are small enough to only return a few matches at most for each query. The more features returned, the more expensive calculations required and the less benefit you get from using the spatial index. This behaviour can sometimes be improved by 'chopping up' larger features into several smaller features. This reduces the number of possible matches, and also reduces the complexity of the geometry used in the more expensive calculations. More on this is a later post...