Speedup Fiona with LRU caching

Posted on

At some point you’re going to care about how long it takes your geoprocessing algorithm to run. While the Fiona and Shapely libraries make efforts to be as efficient as possible, a lot will depend on the algorithms you use them in.

I’ve written before about how spatial indexing can speed up intersection and nearest neighbour queries. At the end of the previous article I mentioned a trade-off between speed and memory usage when the same features are requested more than once. Storing data in fast Random Access Memory (RAM) is quicker than reading it from disk each time, especially if there is some processing required to convert between the representation on disk and in memory. However it isn’t always practical to store an entire dataset in memory due to it’s size.

One solution to this problem is to only cache part of the dataset; but which part? One solution is to use a Least Recently Used (LRU) cache. An LRU cache has a limited size, discarding items that have been used least recently to make space for new items.

There are many implementations of LRU caching available for Python. I’ve had good experiences with lru-dict (available on PyPI or GitHub). Note that lru-dict is a C extension and therefore requires a working compiler to install.

The code below subclasses fiona.Collection to provide LRU caching of features when they are indexed individually (e.g. collection[42]). If the feature has already been accessed it is returned from the cache. Otherwise it is read from the collection normally and stored in the cache for later use. The cache isn’t used when iterating over the entire collection (e.g. for feature in collection: ... as this generally doesn’t benefit from this kind of caching. The size of the cache has been hard-coded to 200 items.

from lru import LRU

class CollectionLRU(fiona.Collection):
    """Wrapper for Fiona Collection that provides LRU read caching
    """
    def __init__(self, *args, **kwargs):
        fiona.Collection.__init__(self, *args, **kwargs)
        self.cache = LRU(200)
    def __getitem__(self, key):
        try:
            feature = self.cache[key]
        except KeyError:
            feature = fiona.Collection.__getitem__(self, key)
            self.cache[key] = feature
        return feature

To simplify opening collections with the CollectionLRU wrapper I wrote a modified version of fiona.open, which should act as a drop in replacement.

def fiona_cache_open(path, mode='r', driver=None, schema=None, crs=None,
                     encoding=None, layer=None, vfs=None, enabled_drivers=None,
                     crs_wkt=None):
    """Open a new Fiona collection with LRU read caching
    """
    assert(mode == 'r')
    path, vsi, archive = fiona.parse_paths(path, vfs)
    collection = CollectionLRU(path, mode, driver=driver,
            encoding=encoding, layer=layer, vsi=vsi, archive=archive,
            enabled_drivers=enabled_drivers)
    return collection

In terms of the speedup to expect from using LRU caching, it depends on the specifics of the algorithm and the dataset. I’ve recently applied it to a couple of my projects which perform intersects on datasets with a large number of a small features and seen a 30% reduction in processing time. YMMV.