Most software projects start with a nice, clean, compartmentalised architecture, whether real or imagined. As implementation progresses, the lines between components tend to blur as unforeseen dependencies emerge and edge cases are dealt with.

However, by the time it comes to deployment, you’ll probably still have a number of separate packages, with some (hopefully acyclic) dependency graph binding them together.

At WebMynd, the web tier runs on Turbogears, a Python web framework. Turbogears is by its nature very modular, with various options for “plugging in” alternative tools and extensions, which has led us to be quite modular with our own code.

Dependencies between these packages is managed via the install_requires setuptools parameters, e.g.:

        "MiniMock >= 1.2.2",
        "Boto >= 1.5",

Here, the “WM…” packages are internal, and we don’t really want to share them on PyPI. So how best to get them installed onto the machines where they’re required?

One option is to grab the source code directly, build and install it into place. Even if your code is in a DVCS, this process can complex, and you’re going to have to store somewhere the URLs and/or levels that each package depends on from the others. But this information is already encoded in a much more concise and flexible way: the install_requires declarations!

We’ve found it convenient to take advantage of this version-controlled dependency graph by hosting our own little package index internally. It’s nice and easy: all that’s required is some easy_install configuration like this:

find_links =

Our internal_server is only accessible from a restricted set of IPs, but you could use other security measures – I’ve just tried basic HTTP authentication and it works: just prepend username:password@ to the domain.

There are a few places you can put this configuration, but we include it in setup.cfg in all our packages, so that install dependencies just take care of themselves, with no hassle and no changes required on the machines. Installing a package is as simple as:

easy_install WMWebTier

Rather than making sure that the right source is pulled down on the right machine at the right time, now you can safely push all your good builds up onto your internal package index and trust that the client selects the right one. You’ve already encoded dependencies in your package metadata – relax and let easy_install do the hard work for you!

The seamless integration of doctest, Nose, Sphinx and MiniMock means that taking a little more time to write your Python doc strings can give you testable documentation, full of examples, in HTML or LaTeX markup, and main-line unit test coverage “for free”.

The bon mariage between these agile tools has worked so well for us that when it came to extending test coverage up to 100% using a full Nose test suite, we were really pining for the painless mock objects that MiniMock gives you.

MiniMock works by printing out your code’s actual usage of mock objects so that it can be compared with the expected usage you specify in the doc string. For example, this function reads a URL and writes it to a file-like object:

import urllib
def write_url(url, out_file):

        >>> from minimock import mock, Mock
        >>> mock('urllib.urlopen', returns=Mock('urlopen_result'))
        >>> write_url('', Mock('out_file'))      #doctest: +ELLIPSIS
        Called urllib.urlopen('')
        Called out_file.write(None)
        <Mock ... out_file>
    page_content = urllib.urlopen(url)
    return out_file

The supplied doctest shows a couple of different mocking methods, and also doctest’s invaluable ELLIPSIS option, which allows for fuzzy matching of the expected output.

When writing unit tests for this method, rather than a single simple doctest, there are two problems.

  1. there’s no convenient way to track the usage of MiniMock-ed objects
  2. the fuzzy matching tools in doctest aren’t particularly conveniently exposed for unit test usage

Tracking MiniMock usage

To track the usage of mocked objects, we subclass minimock.Printer to store the console output in a StringIO object, rather than printing it to sys.stdout:

class TraceTracker(Printer):
    def __init__(self, *args, **kw):
        self.out = StringIO()
        super(TraceTracker, self).__init__(self.out, *args, **kw)
        self.checker = doctest.OutputChecker()
        self.options =  doctest.ELLIPSIS
        self.options |= doctest.NORMALIZE_WHITESPACE
        self.options |= doctest.REPORT_UDIFF

    def check(self, want):
        return self.checker.check_output(want, self.dump(),

    def diff(self, want):
        return self.checker.output_difference(doctest.Example("", want),
            self.dump(), optionflags=self.options)

    def dump(self):
        return self.out.getvalue()

The check() method uses doctest’s OutputChecker to compare the observed and expected mock usage, while diff() returns a human-readable comparison of the observed and expected mock usage.

The basic idea is to store up the messages MiniMock would have printed in a convenient container, and provide some utilities to interrogate those messages.

Matching MiniMock usage

The TraceTracker class shown above already gives us all the functionality we need – all that is required is a convenient utility function:

def assert_same_trace(tracker, want):
    assert tracker.check(want), tracker.diff(want)

This function allows us to check the mock objects are being used as we expected, and prints out a human-readable diff of the expected and observed usage if applicable.

Usage Example

As a concrete example, I’ll convert the doctest for the write_url function to a Nose-style unit test:

def test_write_url():
    tt = TraceTracker()
    mock('urllib.urlopen', returns=Mock('urlopen_result', tracker=tt), tracker=tt)
    write_url('', Mock('out_file', tracker=tt))

    expected_output = """Called urllib.urlopen('')
Called out_file.write(None)"""
    assert_same_trace(tt, expected_output)

The definition of the expected MiniMock usage (called expected_output here) can feel a little clunky, but in our experience, these definitions are quite often common between test cases, so can be defined once and shared.

MiniMock is great for quickly faking out fairly complex external dependencies, with little, if any, compromise on the rigour of your tests. By adapting its usage for unit tests, as described here, you can have all that convenience and power in your more exhaustive test suites.

The code given above is available as MiniMockUnit on PyPI.

Starting with Sphinx version 0.5, you can now control and launch your documentation builds from within the warm fuzzy world of setuptools!


python --help-commands

inside your setuptools project, and if you see a build_sphinx target in the “Extra commands” section, you’re in luck.

The Sphinx build can be configured from your setup.cfg in the same directory. Here are the available options (taken from here):

fresh-env: Discard saved environment
all-files: Build all files
source-dir: Source directory
build-dir: Build directory
builder: The builder to use. Defaults to “html”

For reference, here’s the relevant part of setup.cfg from one of our projects:

source-dir = docs/source
build-dir  = docs/build
all_files  = 1

Note the lack of quotes around the directories – I found that including quotes confused the command.

For large bodies of code, configuration can become fragmented and messy extremely quickly unless you’re very careful; little features like this can really help centralise your configuration, and keep you sane. Pre-requisites, source/binary distributions, unit-tests, documentation and distribution to PyPI all configured through one tool? Yes please!

We use ConfigObj configuration files pretty extensively at WebMynd; it would be nice to use the ConfigParser module available in Python’s standard library, but the extra features ConfigObj has, such as lists, multi-line strings and nested sections, make it hard to say no to the richer library…

Unfortunately, TextMate doesn’t come with support for ConfigObj syntax, but the editor’s excellent Bundle Editor allowed me to fix that pretty easily.

Here is an example ConfigObj file as I see it in TextMate, with two different “Font & Color” schemes:

TextMate language definitions use regular expressions to categorise text in a file (into keywords, constants, variables and so on). The regexes I’ve put together for this ConfigObj bundle are somewhat fragile – if you try to break it you probably will.

However, it should be good enough for the majority of configuration in the majority of files. As an added bonus, ConfigObj syntax is a superset of INI syntax, so you get the full poly-chromatic experience in .cfg and .ini files alike!

If you’re a TextMate user, download this file, unzip it and double-click on ConfigObj.tmbundle.

Scaling on EC2


Like any application developed for a platform, the success of a Firefox Add-on is closely tied to the popularity and distribution you get from the underlying delivery mechanism. So, when we honed down the WebMynd feature set, improving the product enough to get on Mozilla’s Recommended List, we were delighted by our increasing user numbers. A couple of weeks later, Firefox 3 was released, and we got a usage graph like this:WebMynd usage statistics

With a product like WebMynd, where part of the service we provide is to save and index a person’s web history, this sort of explosive expansion brings with it some growing pains. Performance was a constant battle with us, even with the relatively low user numbers of the first few months. This was due mainly to some poor technology choices; thankfully, the underlying architecture we chose from the start has proven to be sound.

I would not say that we have completely solved the difficult problem in front of us – we are still not content with the responsiveness of our service, and we’re open about the brown-outs we still sometimes experience – but we have made huge progress and learned some invaluable lessons over the last few months.

What follows is a high level overview of some of the conclusions we’ve arrived at today, best practices that work for us and some things to avoid. In later weeks, I plan to follow up with deeper dives into certain parts of our infrastructure as and when I get a chance!

Scaling is all about removing bottlenecks

This sounds obvious, but should strongly influence all your technology and architecture decisions.

Being able to remove bottlenecks means you need to be able to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts which will perform as required. This will move the bottleneck somewhere else, at which point you need to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts… well you get the idea. This cycle can be repeated ad infinitum until you’ve optimised the heck out of everything and you’re just throwing machines at the problem.

At WebMynd, for our search backend, we’ve done this four or five times already in the five months we’ve been alive, and I think I still have some iterations left in me. Importantly, I wouldn’t say that any of these iterations were a mistake. In a parallel to the Y Combinator ethos of launching a product early, scaling should be an iterative process with as close a feedback loop as possible. Premature optimisation of any part of the service is a waste of time and is often harmful.

Scaling relies on having discrete pieces with clean interfaces, which can be iteratively improved.

Horizontal is better than vertical

One of the reasons Google triumphed in the search engine wars was that their core technology was designed from the ground up to scale horizontally across cheap hardware. Compare this with their competitors’ approach, which was in general to scale vertically – using larger and larger monolithic machines glued together organically. Other search engines relied on improving hardware to cope with demand, but when the growth of the internet outstripped available hardware, they had nowhere to go. Google was using inferior pieces of hardware, but had an architecture and infrastructure allowing for cheap and virtually limitless scaling.

Google’s key breakthroughs were the Google File System and MapReduce, which together allow them to horizontally partition the problem of indexing the web. If you can architect your product in such a way as to allow for similar partitioning, scaling will be all the more easy. It’s interesting to note that some of the current trends of Web2.0 products are extremely hard to horizontally partition, due to the hyper-connectedness of the user graph (witness Twitter).

The problem WebMynd is tackling is embarrassingly partitionable. Users have their individual slice of web history, and these slices can be moved around the available hardware at will. New users equals new servers.

Hardware is the lowest common denominator

By running your application on virtual machines using EC2, you are viewing the hardware you’re running on as a commodity which can be swapped in and out at the click of a button. This is an useful mental model to have, where the actual machine images you’re running on are just another component in your architecture which can be scaled up or down as demand requires. Obviously, if you’re planning on scaling horizontally, you need to be building on a substrate which has low marginal cost for creating and destroying hardware – marginal cost in terms of time, effort and capex.

A real example

To put the above assertions into context, I’ll use WebMynd’s current architecture:WebMynd architecture

The rectangles represent EC2 instances. Their colour represents their function. The red arrow in the top right represents incoming traffic. Other arrows represent connectedness and flows of information.

This is a simplified example, but here’s what the pieces do in general terms:

  • All traffic is currently load balanced by a single HAProxy instance
  • All static content is served from a single nginx instance (with a hot failover ready)
  • Sessions are distributed fairly across lots of TurboGears application servers, on several machines
  • The database is a remote MySQL instance
  • Search engine updates are handled asynchronously through a queue
  • Search engine queries are handled synchronously over a direct TurboGears / Solr connection (not shown)

One shouldn’t be timid in trying new things to find the best solution; almost all of these parts have been iterated on like crazy. For example, we’ve used Apache with mod_python, Apache with mod_proxy,  Apache with mod_wsgi. We’ve used TurboLucene, looked very hard at Xapian, various configurations of Solr.

For the queue, I’ve written my own queuing middleware, I’ve used ActiveMQ running on an EC2 instance and I’m now in the process of moving to Amazon’s SQS. We chose to use SQS as although ActiveMQ is free as in beer and speech, it has an ongoing operations cost in terms of time, which is one thing you’re always short of during hyper-growth.

The two parts which are growing the fastest are the web tier (the TurboGears servers) and the search tier (the Solr servers). However, as we can iterate on our implementations and rapidly horizontally scale on both of those parts, that growth has been containable, if not completely pain free.

Amazon’s Web Services give growing companies the ideal building blocks to scale and keep up with demand. By iteratively improving the independent components in our architecture, we have grown to meet the substantial challenge of providing the WebMynd service to our users.

Based on inital user feedback on our forum, this blog and TechCrunch we have made a couple of changes to the WebMynd extension for Firefox.

The latest version is 0.2.6 and is available from our homepage. If you’ve already installed the earlier version or trial versions you may be automatically prompted to update it the next time you restart Firefox.

The changes include:

– A bug fix which means we can now support Linux

– A change to the way we take a snapshot of the page which should improve browser performance

There has been some confusion over our charging model and where data is stored so, to reiterate:

– The thumbnails and full images of webpages are stored locally on your hard-drive. The text content is sent up to our servers for indexing so we can offer full text search now and social features in the future. You can ‘playback’ your browse by hitting the WebMynd icon to the right of the URL toolbar. This loads a page from our website to give two playback modes: reeler and grid.

– You will be able to view all browser history and WebMarks (take a WebMark using the star icon next to the url bar) through our interfaces for free. You will be able to search your entire browser history for the last 7 days for free, and your WebMarks indefinitely. However, the index does take up storage space on our servers which is why we offer upgrades if you want to be able to search your whole browser history through our interfaces for longer periods.

Please do keep the feedback coming in and we will respond as quickly as we can. We would love to hear from you.

Here at WebMynd we are taking a new approach to the way you organize what you see on the web. We say don’t organize! Just save everything. When you want to see it again just peek into your WebMynd, what your looking for will be there waiting for you. All of a sudden the internet is an extension of your own memory!

The product and concept have just been covered on TechCrunch, you can read about it here.

We are only at the beginning of what you will be able to do with your WebMynd. Stay tuned for much more to come…

New buttons


We have just launched a new version (v0.2) of the extension with new buttons and menu options based on your feedback. This one is a candidate for a public beta launch so I would really appreciate your comments and feedback.

Hopefully you will notice plenty of changes for the better in this version as well as in the reeler and grid playback pages.

– Amir

Support Forum


We now have a support forum.

So if you have any questions about the WebMynd service, feature requests or technical issues, do post them there.

– Amir

A new release of the WebMynd plug-in is available (v0.1.12) so please do download it and check for updates as it includes a number of fixes and new features which may be useful to you. Do check back regularly as we are releasing updates at least once a day at the moment!