Thursday, July 19, 2012

Measuring CDN Performance With Real Users

This is cross posted on the Wayfair Engineering Blog

A couple of weeks ago I ran a test with WebPagetest that was designed to quantify how much a CDN improves performance for users that are far from your origin.  Unfortunately, the test indicated that there was no material performance benefit to having a CDN in place.  This conclusion sparked a lively discussion in the comments and on Google+, with the overwhelming suggestion being that Real User Monitoring data was necessary to draw a firm conclusion about the impact of CDNs on performance.  To gather this data I turned to the Insight product and its "tagging" feature.

Before I get into the nitty-gritty details I'll give you the punch line: the test with real users confirmed the results from the synthetic one, showing no major performance improvement due to the use of a CDN.

Implementation Details: 

Prior to this test we served our static content (CSS, JS, and images) from three domains:

common.csnimages.com
common1.csnimages.com
common2.csnimages.com


The first domain is where our CSS/JS comes from, and the other two are domains that we shard images across to boost parallel downloading.  To effectively test performance without a CDN we had to setup some new subdomains.  Luckily we already had two other subdomains set up from previous testing.  We configured these subdomains to always hit origin:

common3.csnimages.com
common4.csnimages.com


To ensure that this test was comparing apples to apples, I switched all requests hitting common.csnimages.com to common1.csnimages.com, so when a customer is hitting our CDN they only use the common1 and common2 domains.

Once these were set up I wrote some code to use our "feature knobs" to put someone either fully on the CDN or fully off the CDN, with the ability to adjust what percentage of overall users were using the new subdomains.  I also made the feature cookie based so once you got assigned to a group you stayed there for the remainder of your session (well, really for the life of the session cookie that I set).  Finally, I tagged each page with either "cdn" or "no_cdn" in Insight's page tracking tags (each page is also tagged with a page type):

TBRUM.q.push(['tags','Home, cdn']);

After some testing to make sure that everything worked, I cranked the "no_cdn" users up to 25% and let the test run for a few days.

Results:

As I mentioned above, we didn’t see any appreciable improvement in performance from the use of a CDN.  More specifically, the median load time, average load time, and 90th/99th percentiles were slightly worse with the use of a CDN (basically the same), while 95th percentile was marginally faster. Here are the full results:

Performance for users hitting our CDN

Performance for users hitting our origin


Notice that the difference in pageviews matches the percentage distribution quite closely, with a 75/25 split we would expect to see three times as many people in the "CDN" group, which we do.

What Does This Mean?

This is definitely a surprise, but before we get too down on CDNs, let's talk about the other value they provide:
  1. Origin bandwidth offload
  2. Ability to tolerate spikes in traffic
Both of these points are extremely important, and at Wayfair they justify the expense of a CDN by themselves.  That being said, we were also expecting to see some performance benefit from our CDN, and it is disappointing that we aren't getting one.

It is also important to note that these results are from our CDN and our application, and thus should not be extrapolated to the entire industry as a whole.  You should run your own tests if you think that your CDN could be out performing ours.

Conclusion:

I'm happy that we ran this test, because it gives us a better frame of reference to approach content delivery vendors with going forward.  It also forces us to focus on the real benefits we are getting for our dollar – speed not being one of them at the moment.

In my mind the major takeaways from this experience are the following:
  1. If you are really curious about how something is affecting your real users, you have to instrument your site, do the work to create a split test, and actually measure what's happening.
  2. Depending on the distribution of your users, a CDN may not provide a huge performance benefit.  Make sure that you know what you are paying for and what you are getting with a CDN contract.
  3. If you have the right tools and a decent amount of traffic these kinds tests can be run very quickly.
If you have any questions about methodology or results let me know in the comments!

7 comments:

  1. Great write-up and interesting topic. There are numerous performance best practices. Not all of them apply to every site. But that doesn't mean the best practice is bad - it just might not be relevant at that time for that particular site.

    I analyzed http://www.wayfair.com/ and expected to see some other issue (e.g, blocking JavaScript, daisy-chained stylesheets) on the critical path making CDN performance a lesser issue. But instead the page's onload event is blocked by images - this is rare.

    There are 5 images each over 200K. That's big, but not outlandish. What's strange is that they are served via Akamai but take 5-10 seconds to download. That's really slow!

    I ran this in WPT on IE9 from San Jose and DC: http://www.webpagetest.org/result/120720_RX_S87/ http://www.webpagetest.org/result/120720_8E_S2Z/

    It makes me think something is wrong with Akamai - that's way too long. It appears that it hit appropriate Akamai edge servers (DC hit an Akamai server in VA, San Jose hit an Akamai server in CA). My suggestions: 1. Figure out why it takes 10 seconds to download 200K from Akamai. 2. Try downloading those large images first rather than in the middle. Eg, if you know the images that are going to be needed later start them downloading using JS at the top of HEAD.

    ReplyDelete
    Replies
    1. Thanks for the feedback Steve! I'll reach out to Akamai and see if they have any ideas about why the download is taking so long.

      I will also pass the image pre-fetching suggesting on to our Frontend Engineering team.

      Cheers,

      Jonathan

      Delete
  2. I'd love to get a packet capture of regular user in US trying to open the site via Akamai...

    I saw a situation where (at least in WPT) a particular CDN was 2 times slower than all other CDNs for a particular site. In the case of WPT it was loads of re-transmits due to buffer bloat. I could not re-produce this on servers I have access to. The page in question seems to be similar to yours. Domain sharding + few large images.

    I think that Akamai (or atleast your accounts configuration) is causing re-transmits.

    ReplyDelete
  3. Certainly the 10s download of 200k indicates something wrong.

    For CDN configurations in which *that* is not happening, your results are still common for small sites where the CDN often doesn't have the content cached in most of their POPs, and when most of your traffic is in the same country you're in.

    If you care about latency for people across the globe, that too adds a tick in the CDN's column.

    But generally they don't add value until you have enough transfer that bandwidth limits are a consideration, that traffic spikes are a consideration, that global latency is interesting, and that POPs generally have hot content cached all the time.

    ReplyDelete
  4. This will add to my seo terminology knowledge now. It'll add to the seo info too.

    ReplyDelete
  5. Really interesting article Jonathan. I'd be interested to know what you thought a similar analysis of Video delivery by CDN might throw up. I'm currently trying to set up a video on demand site (on a shoestring!) We have spoken to several CDN's and have been given a great offer but have not yet signed. If what your analysis has shown also applies to Video delivery then I would probably be wasting money until we got into serious bandwidth issues some time after launch. Many thanks - Looking forward to reading many more of your blog posts
    Richard Gammons, CMor.tv Ltd

    ReplyDelete
    Replies
    1. I've never done a thorough analysis of video delivery through CDNs, but from what I understand the main value add there is that they can handle the traffic for you, thus preventing you from needing to have a huge amount of bandwidth to your origin. Getting your content closer to your end users is less important for videos, since you only pay the latency cost once, at the beginning of the connection. Once the data is in flight the latency is largely meaningless. This applies to all large files - bandwidth is more important than latency when you are dealing with a single massive file.

      Delete