Learning from A/B tests

I’ve seen it happen a lot where companies start experimenting on their product flows, landing pages and user experience but can’t easily isolate and pin-point the reasons a certain test has performed better (or worse). This certainly happens more with large scale changes that affect the product in a big way such as a brand new landing page, or a new feature.

I’ve observed this most often among marketers, product managers and developers who don’t come from a – for the lack of a better expression – “classic engineering” or statistical background. Meaning, they have never done real research that requires statistical hypothesis testing; this includes someone like me - a self-taught programmer but not a true engineer or researcher.

We conduct business in a world where we should constantly strive to understand our users’ behaviours as much as possible. Doing so enables us as PMs, marketers, and designers to make stronger, more informed decisions about our products. Experimentation is a powerful way to aid our understanding of user behaviours and one should take full advantage of it.

To help you better understand the outcomes of tests you might be running, I’d like to share an anecdote with you around an A/B test we recently ran at Frank & Oak.


A while back, we were looking at our product page on the mobile website and decided that we wanted to do a test that had the potential to increase our activation and conversion rates - while also bringing a better user experience to our customers.

Here’s what the original product page (baseline) looked like:

As you can see, there were a lot of opportunities for improvement. So, we decided to tackle two things (red flag #1): product images and sizing buttons.

Product Images:
We knew the images were smaller than they needed to be – there was wasted space around the images and the high resolution of our product images wasn’t being used to our advantage. The hypothesis was that bigger, higher resolution images will attract more purchases.

Sizing Buttons:
We had gotten some feedback that our size buttons confused customers. Some thought red buttons, which were the selected state of a size button, stood for the product being out of stock. The hypothesis was that by changing the colour scheme, customers will be less confused and purchase more.

While we were designing the variant for this test, we realized that the extra product images beneath the Add to Cart button just didn’t fit there and concluded that they weren’t providing value (red flag #2). So we decided to remove the extra images all together for our test; here’s what the variation of our test looked like:

Aesthetically speaking, the new variation of our product page both looked and felt better. We designed and launched the test within a couple of hours. It was the perfect test: short design and development time, and lots of potential for payback. I looked at it before we launched it and thought a couple of things:

  1. I work with a kick-ass team.
  2. This is why I come in every morning and don’t leave until I’m falling asleep.
  3. This is going to be a nice improvement and will be a great story to tell when it’s finished.

 Test Results

As you might have guessed by now, the results of the test were not what we expected (welcome to the world of A/B tests). The variation wasn’t winning, or even matching the previous conversion rates – it was doing much worse.

I can live with failure. In fact, learning from one’s failures is one of the most important takeaways from experimentation. However, the worst part of this experiment was that we didn’t have a single hypotheses around why this great improvement on UI/UX failed.*

Which brings us to what we did wrong in the first place:

 Red Flags

1. Testing multiple variations at once.
We were so passionate about improving the experience that we forgot to take it slow and test one assumption at a time. By testing unrelated things at the same time (product image size, button colours, and extra product images) – we increased the number of variables we were playing with and this not only made it harder for us to understand why the test failed at the end, but also prevented us from understanding what matters to the users and what really moves the needle.

This is not to say that multiple changes should never be made, but rather that they should all fall within the same theme and attempt to disprove your hypothesis.

2. Making assumptions without any quantifiable data.
We should not have accepted that the extra images – as ugly as they seemed to us – were not providing value to the experience. Assumptions, even if they might seem obvious, should not be facts. They should be exactly what they are: assumptions that need to be tested.


The premise of statistical hypothesis testing is to prove that a positive (or negative) change in the data has not occurred only based on chance but is statistically significant and can be predicted. And in order to learn from these changes, the breadth of tests should be minimized.

Staying clear from testing too many things at once and focusing efforts around learning from every test is paramount. Each subsequent test will build on its predecessors and help move the needle towards your goal.

If an experiment has resulted in worse results – it still counts as a successful experiment. You should hope to have learned something actionable from it so you can increase your chances of succeeding with the next.

Subscribe to my mailing list for more posts aroud mobile, experimentation and product management here.

Note: we’re now running the same test in around 5 variations – feel free to tweet me if you want me to know which one is winning.


Now read this

A/B Testing on iOS, Dealing with the AppStore, and Moving Fast

Note: I first wrote this post for KISSMetrics. Though, I wanted to keep a copy of it here. Lately, I’ve been meeting with founders and CTOs regarding the challenges of a/b testing on iOS, and I’ve found that I’m repeating myself a lot.... Continue →