Monday, September 5, 2011

How fast is your website? Setting (and keeping) Web performance SLAs

This is a cross post from the Yottaa Performance Blog


I’m sure many of you have heard people at the office ask the question “How fast is our site?”  or some variation of it.  Many of you have probably also realized that this question is largely meaningless.  Anybody who can respond to this question with a single number is trying to sell you something.  There are a huge number of variables that go into measuring the load time of a website:
  • What kind of monitoring are you using, synthetic or real user?
  • If synthetic, where was the test run from, close to your datacenter or across the world?
  • What percentile are you looking at in the data?
  • How many tests (or people for RUM data) are you aggregating to get this number?
  • Which pages are you monitoring?


If you are responsible for creating or conforming to a performance SLA for your website, make sure this SLA is as specific as possible.  Think SMART goal here (specific, measurable, attainable, realistic, and time bound).  To be more explicit, this is what a good SLA looks like:
“The homepage of our site will load in under 3 seconds measured at the 80th percentile via synthetic tests running in New York, LA, Seattle, and Miami every 30 minutes.  We will measure this SLA at 8:00AM every morning and base it off the last 24 hours of data.”


How to pick what that number should be

Picking the right number for your SLA can be challenging.  The first step in this process is to measure where your site is at currently.  If the conditions for your SLA put you at a 5 second load time it might not be wise to drop it to 2 seconds and say you will meet it in a month.  No one rule works for all companies, but this is a good place to start:
Benchmark your top 6-8 competitors using the same conditions that you will use for your site (this will require using synthetic tests instead of real user monitoring) and make sure that you are in the top half.  Once you meet that goal, try to keep moving up.
I think it’s also important to review these agreements regularly.  Try setting up a quarterly performance review meeting where engineers and business owners get together and talk about performance.  Review the data from the quarter, establish where you are in relation to the rest of your industry, and then talk about whether there is a positive ROI in spending the time to improve.

What to do when your SLA is violated

Let’s say you have done all of the above, you have a really specific SLA, you have buy-in from the business side, and you’ve been meeting your targets for 6 months.  What happens when you come in on a Tuesday morning and you have 50 alerts sitting in your inbox saying that you’ve been violating your SLA for the last 12 hours?  When this happens you have a few options:
  1. Rollback all changes that might have caused the violation (this could be many if you are a Continuous Integration shop, or zero if you have a strict release cycle).
  2. Try to optimize the page(s) that you are monitoring to get back within your SLA.
  3. Change/increase the SLA.
Sometimes doing #1 won’t work, since there are no changes (or the failures didn’t result from a code change).  Maybe you are just over capacity and the solution is to buy more servers.  Maybe some piece of hardware failed in your datacenter and all you have to do is replace it.  Or maybe a re-index job went haywire and running it again will bring you back down into compliance.  The key here is to have enough monitoring that you can track down the problem quickly, and enough forethought that you know how you are going to handle it.  If something is physically broken then the answer is easy (get a new one).  If the violation happened because someone added a new feature that the business really wants, then things get a little more complicated.  There’s not really a “right” answer here, just talk it over with the performance review team before it happens and make sure that you HAVE an answer, and one that everyone is happy with (get it in writing if possible).
Creating and adhering to good SLAs is a significant amount of work, but it pays off.  If you are responsible for the performance of your site it really is the only sane way to set realistic expectations (especially if your bonus is based on site performance or something along those lines).  It can also be a “feel good” process, if you have really great monitoring and firm performance requirements you can be happy about meeting them day in and day out.  Good luck!