Caculating Statistical Significance

When you conduct a test, you need to be sure that you’ve tested a high enough number of users (large enough sample size) to confirm that you can expect to see the same results across all users. You want to understand, and avoid, the likelihood that your test was a fluke.

stat-sig.png

Here’s what you do need to know: the greater your sample size, the quicker you’ll achieve a statistically significant result. The lower your sample size, the longer it will take. 

This is important, because in a business context you may want to see results fast. You might very well be asked to test copy on a specific page with low traffic (therefore a small sample size). You’ll have to warn your manager or colleagues that it might take weeks in order to see a statistically significant result.

At which point you have a dilemma: do you stop your test early, and take the results as they are, even if the sample size is not large enough to be statistically significant?

We’ll talk a little bit about this in the upcoming lessons on interpreting and presenting your results, but for now you just need to understand that the greater your test audience, the faster you’ll achieve significance.

This is why it’s impossible to say what the “right” timeline for a test should be. It could be as quick as a few days depending on your traffic. On the other hand, it could take weeks.

In general, you should take into account activities that are happening around your testing period. For instance, is it a particularly busy time for your company? That could influence results. So could paid search campaigns that drive more traffic than usual—a different type of traffic that could influence the results.

We’ll talk about this soon when we actually craft a test together.

If you’d like to have a try at calculating it yourself, there are some handy calculators online:

Let’s take two examples. The first is the Handshake app.

Example 1:

Let’s say the Handshake app has 10,000 active users per day. That’s a lot! If we wanted to create a test that changed the copy encouraging users to create an invoice, that means we would split the audience to 5,000 active users seeing the Control Group, or the “A” Design, and 5,000 users seeing the Variation, or the “B” Design.

It usually takes a consistent stream of users before we achieve statistical significance, and there is no magic number. But given this steady stream of users, we would want to run our test for at least a few weeks to make sure the activity had been seen by enough users.

But volume isn’t enough! The other part of statistical significance is analyzing the change in behavior. If, for instance, we witnessed a 10% increase in invoice creation, after a few weeks in testing, we would be fairly certain such a significant change would have come from our test due to the dramatic shift in behavior, and the volume of users who had seen the change.

Example 2:

On the other hand, let’s say we wanted to run an A/B test on a website for a company. This company wants us to run an A/B test in one of their low-traffic areas, that receives only 300 visits a week.

After running the test for 12 weeks, (longer, because the volume of users is so low), we only saw a 2% uplift in conversion for this particular area of the website.

That’s a low uplift, and we wouldn’t necessarily be confident that our change is the only element that influenced the higher conversion.

One of the ways we can check statistical significance is to look at variation in the previous corresponding period in which no tests were running. What fluctuations do you see in the conversion rates there?

If you regularly record a 2% variation in conversion rates, a test that results in another 2% conversion may suggest your test is not the deciding factor in conversion.

Controlling for circumstances

As a UX writer, it’s not necessarily going to be your job to consider how to create statistical significance by avoiding elements that could influence your test.

On the other hand, you should at least speak with your optimization Manager about them and consider their influence. For instance, promotions that bring new types of demographics to the website during your test could influence the result.

Here’s an optional video from Moz.com to help solidify your understanding of building a hypothesis and testing with the right level of statistical significance.

-UX Writer’s Collective, Content Research & Testing Course

Previous
Previous

Content Design - key takeaways from Sarah Winters

Next
Next

Behavioral and attitudinal UX metrics