Pickling Cython classes

Posted on

Automatic pickle support in Cython is still a pending feature. In order to support pickling of cdef classes you must implement the pickle protocol. This is done by implementing the __getstate__ and __setstate__ methods. Although the official documentation is quite clear, it lacks a simple example and also instruction on handling objects that can’t be directly pickled.

A minimal example is given below for the Person class which stores a name (string) and age (integer).

cdef class Person:
    cdef public str name
    cdef public int age

    def __init__(self):
        print('Person.__init__')

    def __getstate__(self):
        return (self.name, self.age,)

    def __setstate__(self, state):
        name, age = state
        self.name = name
        self.age = age

The __getstate__ method returns an object – in this case, a tuple – which represents the state of the instance and is pickled instead of the contents of the instance’s __dict__ (which is not defined in this class). The __setstate__ method receives the state object and applies it to the instance. Note that the __init__ method of the instance is not called during unpickling.

The example above is simple because the string and integer objects in the state tuple can be serialized automatically. But what about more complex structures, such as a malloc‘ed array of structs with a variable length?

The next example achieves this by serializing the array to a Python bytes object which can be pickled. This is done by casting the _data variable to char* (a free operation) and then to bytes (which invokes PyBytes_FromStringAndSize in the C generated by Cython). Deserialization is done by casting back to char* (invoking PyObject_AsString) and then memcpy to copy the data into the array. The array is exposed to Python as a list of (gears, price) tuples using a Cython property.

from cpython.mem cimport PyMem_Malloc, PyMem_Free
from libc.string cimport memcpy

cdef struct Bicycle:
    int gears
    double price

cdef class MyClass:
    cdef Bicycle *_data
    cdef long size

    def __init__(self):
        print('MyClass.__init__')

    cpdef bytes get_data(self):
        """Serializes array to a bytes object"""
        if self._data == NULL:
            return None
        return <bytes>(<char *>self._data)[:sizeof(Bicycle) * self.size]

    cpdef void set_data(self, bytes data, long size):
        """Deserializes a bytes object to an array"""
        PyMem_Free(self._data)
        self.size = size
        self._data = <Bicycle*>PyMem_Malloc(sizeof(Bicycle) * self.size)
        if not self._data:
            raise MemoryError()
        memcpy(self._data, <char *>data, sizeof(Bicycle) * self.size)

    property data:
        """Python interface to array"""
        def __get__(self):
            return [(self._data[i].gears, self._data[i].price)
                    for i in range(0, self.size)]
        def __set__(self, values):
            self.size = len(values)
            self._data = <Bicycle*>PyMem_Malloc(sizeof(Bicycle) * self.size)
            if not self._data:
                raise MemoryError()
            for i, (gears, price) in enumerate(values):
                self._data[i].gears = gears
                self._data[i].price = price

    def __getstate__(self):
        return (self.get_data(), self.size)

    def __setstate__(self, state):
        self.set_data(*state)

    def __dealloc__(self):
        PyMem_Free(self._data)

PyMem_Malloc and PyMem_Free are used instead of malloc and free as per the recommendation in the Cython documentation on memory allocation. We need to keep track of the array’s size as it’s not possible to retrieve it from the array later [reference].

The class can be subclassed without any difficulty. The MySubclass class below adds a new method (get_average_price) and the owner attribute which stores an instance of Person. The owner attribute needs to be added to the instance’s state. This is done by concatenating it with the state tuple returned by the superclass. As the Person class already implements the pickle protocol for itself it can be added directly to the state tuple.

cdef class MySubclass(MyClass):
    cdef public Person owner

    cpdef double get_average_price(self):
        if not self.size:
            return None
        cdef total = 0
        for i in range(0, self.size):
            total += self._data[i].price
        return total / self.size

    def __getstate__(self):
        state = super(MySubclass, self).__getstate__()
        state = state + (self.owner,)
        return state

    def __setstate__(self, state):
        self.owner = state[-1]
        super(MySubclass, self).__setstate__(state[:-1])

The code below demonstrates the creation, pickling and unpickling of the classes.

import pickle
from example import Person, MySubclass

# create a new instance of Person
dave = Person(name="Dave", age=30)

# pickle the person
d = pickle.dumps(dave)
del(dave)

# unpickle the person
dave = pickle.loads(d)
assert(dave.name == "Dave")
assert(dave.age == 30)

# create a new instance of MySubclass
c = MySubclass()
data = [(1, 50.0), (7, 199.0), (21, 399.0),]
c.data = data
c.owner = dave
assert(c.data == data)

# pickle the instance
d = pickle.dumps(c)
del(c)

# unpickle the instance
c = pickle.loads(d)
assert(type(c) is MySubclass)
assert(c.data == data)
assert(c.owner.name == "Dave")
assert(c.get_average_price() == 216.0)

Also note that for arrays of standard C types (doubles, integers, etc.) NumPy arrays support the pickle protocol – in fact, they use the same approach of copying the data into a bytes object.