Backing SQLAlchemy models with S3

19Jul09

Relational databases, and the object-relational mapping layers which abstract them, are not particularly well suited to storing large blobs of data: images, videos, pictures, compressed files and so on.

Far better than streaming megabytes of binary to the database is to instead keep a reference into a separate store, better suited to the task of saving and serving files.

At WebMynd, we use SQLAlchemy as our ORM and Amazon’s Simple Storage Service (S3) to store our files. We’ve used Boto to create a convenient, transparent way to store a file in SQLAlchemy, with the actual data of the file actually residing in S3. These files can then be served directly from S3, decreasing database size and I/O load, and potentially reducing bandwidth costs.

Transparent changes to file content

Suppose the objects we wish to be backed by S3 have a content attribute, which is the file body itself. What we’re aiming for is to be able to do something like:

file = session.query(File).get(file_id)
file.content="new content"
session.save_or_update(file)
session.flush()

This can be achieved by creating a property on the SQLAlchemy model class:

    def _set_content(self, cont):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(self.key)
        if not key:
            key = Key(bucket=bucket, name=self.key)
        key.set_contents_from_string(cont)
        # if you want to serve files directly from S3:
        key.make_public()
    def _get_content(self):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(self.key)
        if not key:
            pass # complain
        else:
            return key.get_contents_as_string()
    content = property(_get_content, _set_content)

Cleaning up S3 artifacts

The task of keeping S3 synchronised with the database state seems like it would be awkward, perhaps involving database triggers and queues of reconciliation tasks. I was pleasantly surprised to find that SQLAlchemy has an excellent MapperExtension class, which gives you a bunch of hooks to hang custom code off. For example, to delete an S3 key when a SQLAlchemy File object is deleted, you would do something like:

class CleanupS3(MapperExtension):
    def after_delete(self, mapper, conn, inst):
        s3     = boto.connect_s3(aws_id, aws_key)
        bucket = s3.get_bucket(s3_bucket)
        key    = bucket.get_key(inst.key)
        if key:
            key.delete()
        else:
            pass # complain
        return orm.EXT_CONTINUE

mapper(File, file_table, extension=CleanupS3())

A script with a working example can be found here. It requires Boto, SQLAlchemy and some AWS configuration. In real-world usage, you’d want some more error-checking, handling of mime types and you may choose to stream in the file content with Boto’s set_contents_from_file method. You’ll also note that we connect to S3 for every method invocation; if you have frequent changes to file content, using a connection pool for Boto might help improve performance.

Advertisements


%d bloggers like this: