Scaling on EC2
Like any application developed for a platform, the success of a Firefox Add-on is closely tied to the popularity and distribution you get from the underlying delivery mechanism. So, when we honed down the WebMynd feature set, improving the product enough to get on Mozilla’s Recommended List, we were delighted by our increasing user numbers. A couple of weeks later, Firefox 3 was released, and we got a usage graph like this:
With a product like WebMynd, where part of the service we provide is to save and index a person’s web history, this sort of explosive expansion brings with it some growing pains. Performance was a constant battle with us, even with the relatively low user numbers of the first few months. This was due mainly to some poor technology choices; thankfully, the underlying architecture we chose from the start has proven to be sound.
I would not say that we have completely solved the difficult problem in front of us – we are still not content with the responsiveness of our service, and we’re open about the brown-outs we still sometimes experience – but we have made huge progress and learned some invaluable lessons over the last few months.
What follows is a high level overview of some of the conclusions we’ve arrived at today, best practices that work for us and some things to avoid. In later weeks, I plan to follow up with deeper dives into certain parts of our infrastructure as and when I get a chance!
Scaling is all about removing bottlenecks
This sounds obvious, but should strongly influence all your technology and architecture decisions.
Being able to remove bottlenecks means you need to be able to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts which will perform as required. This will move the bottleneck somewhere else, at which point you need to swap out discrete parts which aren’t performing well enough, and swap in bigger, faster, better parts… well you get the idea. This cycle can be repeated ad infinitum until you’ve optimised the heck out of everything and you’re just throwing machines at the problem.
At WebMynd, for our search backend, we’ve done this four or five times already in the five months we’ve been alive, and I think I still have some iterations left in me. Importantly, I wouldn’t say that any of these iterations were a mistake. In a parallel to the Y Combinator ethos of launching a product early, scaling should be an iterative process with as close a feedback loop as possible. Premature optimisation of any part of the service is a waste of time and is often harmful.
Scaling relies on having discrete pieces with clean interfaces, which can be iteratively improved.
Horizontal is better than vertical
One of the reasons Google triumphed in the search engine wars was that their core technology was designed from the ground up to scale horizontally across cheap hardware. Compare this with their competitors’ approach, which was in general to scale vertically – using larger and larger monolithic machines glued together organically. Other search engines relied on improving hardware to cope with demand, but when the growth of the internet outstripped available hardware, they had nowhere to go. Google was using inferior pieces of hardware, but had an architecture and infrastructure allowing for cheap and virtually limitless scaling.
Google’s key breakthroughs were the Google File System and MapReduce, which together allow them to horizontally partition the problem of indexing the web. If you can architect your product in such a way as to allow for similar partitioning, scaling will be all the more easy. It’s interesting to note that some of the current trends of Web2.0 products are extremely hard to horizontally partition, due to the hyper-connectedness of the user graph (witness Twitter).
The problem WebMynd is tackling is embarrassingly partitionable. Users have their individual slice of web history, and these slices can be moved around the available hardware at will. New users equals new servers.
Hardware is the lowest common denominator
By running your application on virtual machines using EC2, you are viewing the hardware you’re running on as a commodity which can be swapped in and out at the click of a button. This is an useful mental model to have, where the actual machine images you’re running on are just another component in your architecture which can be scaled up or down as demand requires. Obviously, if you’re planning on scaling horizontally, you need to be building on a substrate which has low marginal cost for creating and destroying hardware – marginal cost in terms of time, effort and capex.
A real example
The rectangles represent EC2 instances. Their colour represents their function. The red arrow in the top right represents incoming traffic. Other arrows represent connectedness and flows of information.
This is a simplified example, but here’s what the pieces do in general terms:
- All traffic is currently load balanced by a single HAProxy instance
- All static content is served from a single nginx instance (with a hot failover ready)
- Sessions are distributed fairly across lots of TurboGears application servers, on several machines
- The database is a remote MySQL instance
- Search engine updates are handled asynchronously through a queue
- Search engine queries are handled synchronously over a direct TurboGears / Solr connection (not shown)
One shouldn’t be timid in trying new things to find the best solution; almost all of these parts have been iterated on like crazy. For example, we’ve used Apache with mod_python, Apache with mod_proxy, Apache with mod_wsgi. We’ve used TurboLucene, looked very hard at Xapian, various configurations of Solr.
For the queue, I’ve written my own queuing middleware, I’ve used ActiveMQ running on an EC2 instance and I’m now in the process of moving to Amazon’s SQS. We chose to use SQS as although ActiveMQ is free as in beer and speech, it has an ongoing operations cost in terms of time, which is one thing you’re always short of during hyper-growth.
The two parts which are growing the fastest are the web tier (the TurboGears servers) and the search tier (the Solr servers). However, as we can iterate on our implementations and rapidly horizontally scale on both of those parts, that growth has been containable, if not completely pain free.
Amazon’s Web Services give growing companies the ideal building blocks to scale and keep up with demand. By iteratively improving the independent components in our architecture, we have grown to meet the substantial challenge of providing the WebMynd service to our users.
Filed under: technical, Website | 23 Comments
Tags: ec2, haproxy, nginx, performance, python, scaling, solr, turbogears, webmynd