implement a MVT winner deployer #546

brianboyer · 2015-03-09T21:29:21Z

would love to be able to make a choice about a test while a story is still hot -- not just wait for the next project.

sort of like this:

launch the app
keep an eye on the numbers for our tests
when we reach our set level of confidence, we choose
run a command "fab production deploy test-one" that deploys a patch to the site (the settings json doc?) to collapse the test to one option

it seems like, on a hot piece, we could hit our confidence level pretty quickly.

eads · 2015-04-06T14:51:16Z

@livlab @TylerFisher -- @dannydb and I have some questions about this!

We understand the issue is that we haven't re-deployed when there's a clear winner for an MVT. Can you tell us a bit about why that is? What do you think will close the feedback loop? There's a lot of tricky details in implementing something like this and we want to hone in on the most important part of the problem.

TylerFisher · 2015-04-06T15:07:31Z

The trickiest part of this is picking a winner in time since we don't get a full readout on event stats until the next day. Basically, in a simple A/B test, we need to figure out if the control or hypothesis scenario was more successful within a reasonable confidence interval. For that, we need the raw number of possible conversions and the raw number of conversions in each scenario

We could sit and look at the live events tracker and have some way of tabulating on our end, but that seems pretty difficult to me.

As far as having a fab command, we would need a way of injecting code from a flag of sorts. Every test is a little different -- shows and hides different divs, logs different analytics events, might trigger some sort of animation, so it has to be really general to handle all of those cases.

eads · 2015-04-06T15:14:13Z

Can we get the full event numbers from Google Analytics via the API in a
timely fashion (e.g. on the order of hours after launch?)

My inclination is to focus on closing the information loop, and leaving it
up to the developer(s) to decide how to "pick the winner".

There's a further issue about the fab command: memorializing the winner. If
test-one is the winner and you start doing fab production deploy test-one
and then you're out sick and I get called in to deploy a bug fix, how will
I know to use test-one?

On Mon, Apr 6, 2015 at 11:07 AM, Tyler Fisher [email protected]
wrote:

The trickiest part of this is picking a winner in time since we don't get
a full readout on event stats until the next day. Basically, in a simple
A/B test, we need to figure out if the control or hypothesis scenario was
more successful within a reasonable confidence interval. For that, we need
the raw number of possible conversions and the raw number of conversions in
each scenario

We could sit and look at the live events tracker and have some way of
tabulating on our end, but that seems pretty difficult to me.

As far as having a fab command, we would need a way of injecting code from
a flag of sorts. Every test is a little different -- shows and hides
different divs, logs different analytics events, might trigger some sort of
animation, so it has to be really general to handle all of those cases.

—
Reply to this email directly or view it on GitHub
#546 (comment)
.

David Eads | http://recoveredfactory.net

"Medical statistics will be our standard of measurement: we will weigh life
for life and see where the dead lie thicker, among the workers or among the
privileged." -- Rudolf Virchow

TylerFisher · 2015-04-06T15:32:51Z

Hm. The trick is getting full numbers, not a sample, from the API. I think there might be a way, but I don't know if they'll get you full numbers in time.

I think maybe an argument in the fab command isn't the right pattern. It's probably a flag in app_config that you set once and then gets committed.

livlab · 2015-04-06T15:46:02Z

Based on our discussion with the Insights team, we could move to our own GA instance, then, given our volume of traffic, our numbers would be reported fully not sampled. We lose a lot of other things with that, obviously (relationship with all other NPR things). I don't know if there is any other way to get the full dataset via API, all answers so far point to no. (We definitely should not make test decisions based on a sample, I think everyone is already in agreement on that but writing down for the record!).

If we can figure out the above issue, then we also need to build in the math that we usually do to calculate the confidence level. We do this manually so if it's done programmatically, it would need to happen (maybe on carebot?) as a monitoring option: if confidence reaches over 95% only then should a decision be made about promoting a winner to production (regardless of whether that's automatic or requires a human decision).

After that, also everything else Tyler said about how the test is coded.

brianboyer · 2015-04-06T18:34:15Z

Can we track in two GA instances. One that's ours, for better data, and one
that's for all of NPR, so we can know about how our stuff relates to other
stuff?

On Mon, Apr 6, 2015 at 11:46 AM, Livia Labate [email protected]
wrote:

Based on our discussion with the Insights team, we could move to our own
GA instance, then, given our volume of traffic, our numbers would be
reported fully not sampled. We lose a lot of other things with that,
obviously (relationship with all other NPR things). I don't know if there
is any other way to get the full dataset via API, all answers so far point
to no. (We definitely should not make test decisions based on a sample, I
think everyone is already in agreement on that but writing down for the
record!).

If we can figure out the above issue, then we also need to build in the
math that we usually do to calculate the confidence level. We do this
manually so if it's done programmatically, it would need to happen (maybe
on carebot?) as a monitoring option: if confidence reaches over 95% only
then should a decision be made about promoting a winner to production
(regardless of whether that's automatic or requires a human decision).

After that, also everything else Tyler said about how the test is coded.

—
Reply to this email directly or view it on GitHub
#546 (comment)
.

livlab · 2015-04-06T18:41:25Z

You can.

Note: if you are using Universal Analytics you can have multiple (analytics.js) per page, but if you are still on Classic (ga.js) only one of those per page.

If I recall correctly from our conversation with Dan, they were still on Classic Analytics with plans to move to Universal. Do you remember this Tyler?

So, just something to check if you wanted to implement redundant tracking.

onyxfish · 2015-04-06T18:58:29Z

That's correct, they have been planning to move to universal, but it hasn't happened yet, so we're stuck with a single tracking code for the moment. 👎

livlab · 2015-04-06T19:08:06Z

Well, not sure how GA accounts would be procured here but, what if "our" new instance was started in Universal? A single ga.js classic can co-exist with a new analytics.js universal. Possible, you think?

onyxfish · 2015-04-06T19:10:56Z

That could work, although I'm a little leery of the overhead of double-implementing every event...

onyxfish · 2015-04-06T19:24:45Z

On the plus side, we could wire this into analytics.js, which would make it less programming work. It's just the network overhead we'd have to worry about. Esp. on single-pipe mobile devices like iPhone.

eads · 2015-04-06T19:29:58Z

It's definitely worth testing the overhead to see if it's an issue. The
messages are small and GA is asynchronous, so it doesn't seem like it will
have a noticeable effect, but I could still see it messing with latency on
single pipe devices.

On Mon, Apr 6, 2015 at 3:24 PM, Christopher Groskopf <
[email protected]> wrote:

On the plus side, we could wire this into analytics.js, which would make
it less programming work. It's just the network overhead we'd have to worry
about. Esp. on single-pipe mobile devices like iPhone.

—
Reply to this email directly or view it on GitHub
#546 (comment)
.

David Eads | http://recoveredfactory.net

"Medical statistics will be our standard of measurement: we will weigh life
for life and see where the dead lie thicker, among the workers or among the
privileged." -- Rudolf Virchow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement a MVT winner deployer #546

implement a MVT winner deployer #546

brianboyer commented Mar 9, 2015

eads commented Apr 6, 2015

TylerFisher commented Apr 6, 2015

eads commented Apr 6, 2015

TylerFisher commented Apr 6, 2015

livlab commented Apr 6, 2015

brianboyer commented Apr 6, 2015

livlab commented Apr 6, 2015

onyxfish commented Apr 6, 2015

livlab commented Apr 6, 2015

onyxfish commented Apr 6, 2015

onyxfish commented Apr 6, 2015

eads commented Apr 6, 2015

implement a MVT winner deployer #546

implement a MVT winner deployer #546

Comments

brianboyer commented Mar 9, 2015

eads commented Apr 6, 2015

TylerFisher commented Apr 6, 2015

eads commented Apr 6, 2015

TylerFisher commented Apr 6, 2015

livlab commented Apr 6, 2015

brianboyer commented Apr 6, 2015

livlab commented Apr 6, 2015

onyxfish commented Apr 6, 2015

livlab commented Apr 6, 2015

onyxfish commented Apr 6, 2015

onyxfish commented Apr 6, 2015

eads commented Apr 6, 2015