Thursday, July 19, 2012

Measuring CDN Performance With Real Users

This is cross posted on the Wayfair Engineering Blog

A couple of weeks ago I ran a test with WebPagetest that was designed to quantify how much a CDN improves performance for users that are far from your origin.  Unfortunately, the test indicated that there was no material performance benefit to having a CDN in place.  This conclusion sparked a lively discussion in the comments and on Google+, with the overwhelming suggestion being that Real User Monitoring data was necessary to draw a firm conclusion about the impact of CDNs on performance.  To gather this data I turned to the Insight product and its "tagging" feature.

Before I get into the nitty-gritty details I'll give you the punch line: the test with real users confirmed the results from the synthetic one, showing no major performance improvement due to the use of a CDN.

Implementation Details: 

Prior to this test we served our static content (CSS, JS, and images) from three domains:

common.csnimages.com
common1.csnimages.com
common2.csnimages.com


The first domain is where our CSS/JS comes from, and the other two are domains that we shard images across to boost parallel downloading.  To effectively test performance without a CDN we had to setup some new subdomains.  Luckily we already had two other subdomains set up from previous testing.  We configured these subdomains to always hit origin:

common3.csnimages.com
common4.csnimages.com


To ensure that this test was comparing apples to apples, I switched all requests hitting common.csnimages.com to common1.csnimages.com, so when a customer is hitting our CDN they only use the common1 and common2 domains.

Once these were set up I wrote some code to use our "feature knobs" to put someone either fully on the CDN or fully off the CDN, with the ability to adjust what percentage of overall users were using the new subdomains.  I also made the feature cookie based so once you got assigned to a group you stayed there for the remainder of your session (well, really for the life of the session cookie that I set).  Finally, I tagged each page with either "cdn" or "no_cdn" in Insight's page tracking tags (each page is also tagged with a page type):

TBRUM.q.push(['tags','Home, cdn']);

After some testing to make sure that everything worked, I cranked the "no_cdn" users up to 25% and let the test run for a few days.

Results:

As I mentioned above, we didn’t see any appreciable improvement in performance from the use of a CDN.  More specifically, the median load time, average load time, and 90th/99th percentiles were slightly worse with the use of a CDN (basically the same), while 95th percentile was marginally faster. Here are the full results:

Performance for users hitting our CDN

Performance for users hitting our origin


Notice that the difference in pageviews matches the percentage distribution quite closely, with a 75/25 split we would expect to see three times as many people in the "CDN" group, which we do.

What Does This Mean?

This is definitely a surprise, but before we get too down on CDNs, let's talk about the other value they provide:
  1. Origin bandwidth offload
  2. Ability to tolerate spikes in traffic
Both of these points are extremely important, and at Wayfair they justify the expense of a CDN by themselves.  That being said, we were also expecting to see some performance benefit from our CDN, and it is disappointing that we aren't getting one.

It is also important to note that these results are from our CDN and our application, and thus should not be extrapolated to the entire industry as a whole.  You should run your own tests if you think that your CDN could be out performing ours.

Conclusion:

I'm happy that we ran this test, because it gives us a better frame of reference to approach content delivery vendors with going forward.  It also forces us to focus on the real benefits we are getting for our dollar – speed not being one of them at the moment.

In my mind the major takeaways from this experience are the following:
  1. If you are really curious about how something is affecting your real users, you have to instrument your site, do the work to create a split test, and actually measure what's happening.
  2. Depending on the distribution of your users, a CDN may not provide a huge performance benefit.  Make sure that you know what you are paying for and what you are getting with a CDN contract.
  3. If you have the right tools and a decent amount of traffic these kinds tests can be run very quickly.
If you have any questions about methodology or results let me know in the comments!

Friday, July 6, 2012

Measuring CDN Performance With WebPagetest

When I was at Velocity I heard about a quick and useful trick you can do with WebPagetest to measure the effectiveness of your CDN.  The steps are pretty simple:
  1. Test a URL on your site from a location that is far from your origin.
  2. Using the scripting tab in WebPagetest, point your CDN domains at your origin IP, and run another test.
  3. Compare the results, and see how much your CDN is helping you!
Let's break this down for a Wayfair URL.

Step 1:  Test a URL on Your Site Normally

Since we only have our static content on our CDN, I chose a URL that has a lot of images on it, what we call a "superbrowse" page - http://www.wayfair.com/Outdoor-Wall-Lights-C416512.html.  Since our origin is in the Boston area, I picked the LA node in WebPagetest to give our CDN the best chance of success.  To try and smooth out the results, I configured the test to run 10 times.  It also helps to login and label your test so you can easily find it later.  


While this was running I moved on to step 2...

Step 2:  Use the WebPagetest Scripting Engine to Point to Your Origin

I set this test up almost the exact same way, except in the script tab I entered the following:

 setDns common.csnimages.com 209.202.142.30  
 setDns common1.csnimages.com 209.202.142.30  
 setDns common2.csnimages.com 209.202.142.30  
 navigate http://www.wayfair.com/Outdoor-Wall-Lights-C416512.html  

This points all of our image domains (we are using domain sharding for performance) at our origin IP address and bypasses our CDN entirely.

At this point I just had to wait for the tests to complete and compare the results.

Step 3:  Comparing and Interpreting Results

Here are the test results over 10 runs:

Test Type Mean Median Standard Deviation
With CDN 6.7055 5.744 1.74693
Without CDN 6.758 6.221 1.61

Based on these results, it appears that our CDN isn't actually providing much benefit! There is almost no difference in the mean for the two tests, and the difference in the median is well within one standard deviation.  Are we sure this worked?  Let's check the details:

With the CDN in place we get external IP addresses for every image (even though the location is an ambiguous "United States"):



With the scripted test we are hitting the origin in Boston, MA for every image:


So the test is working as expected.  Let's look a little closer...the column on the left is "Time to First Byte" (TTFB) and the next column over is "Content Download".  With a CDN you expect TTFB to be comparable or better than your origin, since CDNs should have highly optimized servers and the first packet shouldn't have too far to go.  The content download should be MUCH faster, since the edge nodes are closer to the end user than your origin is.  As we scan down these screenshots we can see that for the CDN the content download times vary widely, from 2ms to 1500ms, and for images that are less than 10KB!  This is not great performance.  I'm also surprised that the TTFBs are so high, I would expect the first byte of an image to be served off an edge node in well under 100ms.  The origin requests are slower on the whole, but more consistent.  These two factors combine to make the difference in the two tests negligible.  

Based on this brief analysis of one URL, it looks like our CDN isn't providing a whole lot of benefit.  Granted, these images are on the smaller side, and CDNs should perform much better for heavier content, but the reality is that most of our static content is product images that are less that 10KB, so this is our primary use case.  I'm going to take this data back to our CDN and have a conversation, but in the meantime I hope you use this technique to see how your CDN is impacting your site's performance.