Saturday, 16 February 2008

Be My Anti-Valentine - a technical perspective

After many hours of toil, Meg and myself opened the doors to this year's By My Anti-Valentine site a couple of weeks ago. Redesigned, revamped, relaunched with the hope that it won't fall over under the strain of the traffic it generates on Valentine's Day itself, as it has always done previously.

Given that my involvement in the project is one of technical architect, developer and sysadmin (Meg is the ideas, UX and design brains of the effort), it's probably worth me explaining a bit about the technology that drives the site and the problems that we came across last year and have tried to solve (or at least mitigate) this year.

By 2007, Meg had been running the VD cards site seasonally for six years, driven by a variety of successive flat-file systems and ultimately a generic PHP-based freeware card sending application (Sendcard, I think) that just about did the job but frequently caused her server to go belly-up once the traffic to the site reached its inevitable peak on the 14th February.

The problem was that the "choose card-send email to recipient" workflow was completely synchronous and linear within the application, so the emails were dispatched to the outgoing email server within the same web-spawned PHP Apache process. Naturally, once a few hundred people at a time were hitting the 'Send Card' button, the host's SMTP service would run out available connections and stop responding. Also, PHP 4's rather, ahem, unpredictable DB connection handling under load would generally explode along similar lines. Cue lots of server errors and general mayhem as the whole thing fell apart.

So, when I stepped in last year and volunteered to make the problems go away with a bespoke web app written in Perl to suit Meg's desired functionality, I made a few sensible design decisions. Over the years I've had to deal with my fair share of scaling problems and integration issues, so I built in a few tricks of the trade.

As it turned out, the site worked nicely for the few weeks before the 14th of February.. until it didn't. As it turned out, the traffic the site got was so enourmous that even moving it to its own dedicated, beefy server (kudos to Pair, Meg's hosting people, for being extraordinarily quick to do this) couldn't make it cope. Even extensive caching couldn't save the server from falling over.

So for 2008, whilst reconsidering our options, I decided to give Amazon EC2 - their Xen virtualisation cloud - a whirl. I'd heard plenty about it and had tinkered with virtualisation at work, but never gone the whole hog, and the cynic in me didn't buy the spiel that spinning up working application instances was as easy as a web service call.. but yes, by gum, it works a treat. Initial setup is a bit fiddly, but tools like the EC2 Firefox UI and S3Fox are a godsend when you're trying to do stuff quickly.

The site runs on a group of replicated server instances running a heavily customised version of Ubuntu 7.10 Gutsy, all identical with the exception of one (which I call the 'master', though not a Doctor Who homage) that also runs MySQL and the card-sending backend app. All the other instances connect to the master database when creating a new card or retrieving one for display. This is a pretty naïve architecture, but it works as the percentage of card senders against the overall site traffic is pretty small (less than 7% of visitors choose to send a card). Therefore MySQL isn't doing very much at all, and I didn't need to spin up a separate MySQL server instance to cope, which reduced the overall cost and was one less server instance to manage.
The rest of the application stack is fairly standard - Apache 2, FastCGI, Perl.

So what happened on the day itself? Well, we had a *tiny* bit of a wobble at around 13:00 GMT, when the east coast of America woke up (55% of users come from the USA), but I coped by spinning up another server instance and offloading some of the remaining site furniture onto Amazon S3 with some 302 redirects - which made a huge difference. Once the traffic started to drop off on the 15th, I was able to decommission three instances and balance the load across the remaining two, which coped admirably.

Paul's Top N Web Scalability Tips

  • Never maintain state on your web server. The moment you have to keep a user on a specific application instance for their state to be persisted, you lose all hope of ever scaling the sharp end horizontally. Keep state in your database (or on a shared filesystem if you're a masochist).

  • Decouple the web interface from backend systems. You don't send emails straight from a web-spawned process. You just don't. In fact, you shouldn't connect to anything other than a database server. Email servers are strange, unpredictable beasts and they generally need to be handled with care. Don't expose one to a process pipeline that you can't explicitly control (namely, a public web server!).

    So, when a card instance is 'created' by a user on the VD site, an email object is created and stashed into a database table that effectively acts as a queue. This gets polled by a cron job at regular intervals and the emails get despatched in series. This means the email server is only ever under a constant load, which can be managed depending on what sort of performance you need.

  • Cache everything you can cache. If a page never changes, save it out to a cache directory on your filesystem and have the web server serve that instead of building a dynamic page on every hit. Thankfully the VD site only really needs to invoke the application when someone sends a card or retrieves one from a link they've been sent, so I took advantage of that and had easy cache-building methods that I could call to generate flat HTML pages. ('Proper' setups have caching loadbalancers like ZXTMs in front of their servers that can do this transparently, but of course I don't have that luxury and was trying to KISS). Which all leads me on to...

  • mod_rewrite is your friend. I can't rave about mod_rewrite enough. It's easy to get scared by URL rewriting if you're not comfortable with regexes and the general HTTP request handling mechanism in Apache, but it's actually incredibly simple to use with a bit of messing around and most modern web frameworks rely on some form of rewrite engine. The ability to abstract your URL scheme from the physical filesystem brings endless possibilities.

    The VD site uses it not only to map its tidy URLs to the application endpoint (what us old-schoolers still call a CGI script), but to rewrite certain URLs to serve flat files from the cache directory but keeping the external URL the same, so turning it off has no effect on the end user. Also, with the flick of a hash character in the global .htaccess file, I can have the all the bandwidth-heavy images served from a CDN - Amazon S3 in this case - rather than the app servers.

  • Always use INSERT DELAYED when you don't need the row back - Class::DBI, the Perl ORM engine the VD site uses, isn't very good at this, so I had to roll my own methods here, but it was worthwhile - you save a few vital queries when you're stashing your state away and trying to get back to completing the user's request as quickly as possible.

    In retrospect I wouldn't have used Class::DBI in the first place, but that's another story..

Phew! That's it over for another year, then. There's a whole bunch of other stuff I could go into (the joys of dealing with Spamcop..), but that's the story thus far. I'm chuffed at having got through a Valentine's Day without having to watch a single server helplessly grind to a halt under the weight of the traffic burst that a site like BMAV inevitably generates, and incredibly impressed that Amazon EC2 makes virtualisation accessible and affordable for projects like this. EC2 FTW!

No comments:

Post a Comment