AB Testing completely broken | Voters

AB Testing completely broken

Christoph Müller

Seems that AB is broken? It shows entirely different metrics than the other dashboard (see screenshots).

Retention on the same day shows in AB at 59% and 50% while in the dash at 48%. How is that possible?

July 21, 2023

Nicky Toma

Hey Christoph!

I will try to answer all your questions in one post:

"Retention on the same day shows in AB at 59% and 50% while in the dash at 48%. How is that possible?"

A: I'm not 100% sure, are you looking at the explore tool in the first picture with retention? A/B testing only accepts new users, and those will be split by groups. So if you're looking only at retention in the explore tool, it won't correspond at all with the a/b test.

"And another problem, on Android it's not even picking up all the users. Like only 1/5 or so are being entered the A/B test. Why?"

A: It could be only 1/5 users from Android (approximately) are new users, so only those are added.

"Looked at playtime per user per day and yes, off as well, for the same day, it shows completely different data."

A: I checked and indeed it was explore tool. A/b testing accepts only new users, so comparing playtime like that will not yield the same numbers, as the explore tool one is aggregated with new and returning, and people that have not participated in the test.

I strongly believe this is why you see the differences that you've mentioned, but I have notified our backend team and they'll look over the test and see if something looks weird.

Other than that, it seems that you are not having a good time using the new version Christoph. If you want and have the time, we would like to have a conversation with you on what changes you'd like to see implemented, as we do plan our roadmaps based on what users say. Let me know if you'd like to talk!

Christoph Müller

Nicky Toma: we already had a talk and none of my suggestions have been implemented in the last months, not even the simplest of them.
Regarding the AB tool:
1 Retention:
yes I was looking at the explore tool and I was looking at new users for that particular day which definitely should have matched. There is no difference between new users and new users when I look at D0. Or is there? If so, can you please explain the difference between new users in the explore tool and new users in the AB tool?
2 Android:
Again, I even feel offended with your reply. How can you even say something like this? Obviously are we talking only about new users and I'm comparing those with new users from the explore tool which as well only shows new users when looking at the cohort. And this is where it's completely off. I have 500 new users in the cohort yet only 120 in the AB tool for the same day. Can you please explain why this is happening?
Taking into consideration the above, why are you saying that you are strongly believing that this is the reason of the differences? Wow, that is your reply when I clearly am showing that the tool is completely broken showing completely different data between the two tools instead of fixing it asap?
You are right, I'm far away of having a good time with the new version. Such a bad service to be honest and such a shame as the old one was pretty good. But this is for me probably the final reason to change to the competition ByteBrew. I simply can't believe it.

Nicky Toma

Christoph Müller: Sorry if my previous reply made it seem like I'm not investigating or I'm brushing off your concerns, that was really not my intention. I was just trying to think about what could be happening. Sorry that I presumed you're not looking at new users. I get a lot of tickets, and sometimes users don't, it was nothing else than that. I have been investigating more after I left the message, and today, and you're right, for the android version only 1/5 people are getting added to the test. 
I will ask backend to investigate why only this number of users get put into the tests. I can see the iOS version with substantially more users in the test, with about the same DAU. 
Could be the other metrics in the A/B test are affected by this to some degree.
Again, sorry, it's not my intention to sound annoying or dismissive, and I can see how my previous reply looked that way.

Nicky Toma

Hey Christoph. I have more information about this. Our backend and SDK teams noticed you are using 7.4.1 for the game, which we identified with having some issues with session_num on android, that could lead to this. This has been fixed in subsequent versions. If you have time/will to update, that would be ideal.
So to summarize, the best choice would be to update the SDK and perform the test again, but I can see how that might be annoying/time consuming. Sorry for the inconveniences. If you do upgrade the SDK and keep the same build, let me know. I can inform our backend team to keep an eye on the data and see if the issue still occurs.
As a last thing you can still use the legacy version of GA if you prefer that one, by selecting the following button

Christoph Müller

Nicky Toma: Yeah Android has been wrong all over. And I'd definitely have to repeat the test. Will need to see if I still have the resources though. 
Legacy tool would be nice if it's saved so we automatically get back into legacy when we log back in. It's annoying to have to switch every time.
If your team can look at the iOS A/B test, I let it run now for 30 days and in my test, variant B is suggested to be the winner with 60%+ chance. I can't understand though why since retention and ad revenue per user per day (which was the main goal) have worse results than variant A... is this a broken feature too or am I just too stupid to read the results/data?

Nicky Toma

Christoph Müller: Hey! You're definitely reading the data right. Variant 1 does seem to have slightly worse revenue, at the moment.
What I've noticed though is that, the users exposed when the results for the A/B test was done is significantly less than what they are in the table. This is because the A/B test will end automatically when it has enough data (perhaps we should allow users to have a timeframe they want the test to be ran, without automatically ending when it has enough data. I have asked our teams when exactly does it consider it has enough data). For the model there were around 2000 users recorded, and for the table right now there are around 6000 . Because you let it ran more for more time, it turns out for ad revenue per user the variant that was considered a winner is now performing slightly worse.
It's very good that you let it ran for more time after the model finished picking the winning variant. Something to mention is that in this case retention doesn't count towards the model picking a winning variant, as it will only test for ad revenue per user in this case.
Hope this makes sense!
EDIT: The winning variant will be picked after 7 days since the start of the test if the number of users in each variant is >500

Christoph Müller

Nicky Toma: I do think that when you let the A/B test keep running, the new data should be updated nonetheless. Otherwise it wouldn't make sense at all to be even able to do that.

Question: Is ARPAU only considering IAPs? Or does it include advertising too as it should?

Nicky Toma

Christoph Müller: Right now only IAP's, but it has already sparked a conversation around changing it to reflect revenue from ads as well.

Thanks for the feedback!

Christoph Müller

Nicky Toma: If ARPAU only has IAP please at least label it ARPPAU. Otherwise it's just misleading. But honestly this metric doesn't make any sense if you don't include advertising. So not sure what the conversation is about. It's literally useless if advertising is not included.
And I insist. Please Make A/B testing consider the entire time span, otherwise there is no sense to keep a test running over 7 days/500 users. Which frankly, makes a A/B test useless as well. Google Firebase runs for 14 days.
In short: GA is useless compared to other analytics tools and I will be moving on to Bytebrew and Firebase entirely. In the 30+ days I have been running this A/B test you guys didn't even consider fixing or changing something based on my experience. I don't see your commitment to your user base - like at all. I guess it's a resources problem but as a client I honestly don't care. I need a tool that works, not one that is half baked and can't take action when actions are needed.
Thanks but no thanks.

Christoph Müller

Nicky Toma: Also funny that you hide my thread entirely when you search for AB testing... but I guess it says it all.

Christoph Müller

And another problem, on Android it's not even picking up all the users. Like only 1/5 or so are being entered the A/B test. Why?

Christoph Müller

So this feature is completely broken. I doubt that ad revenue metrics are correct either... what a fucking mess.

Now I need to re-do the entire test with firebase.

Christoph Müller

If retention is off like this I'm sure other metrics are as well. I tried to find playtime per session but can't find that anywhere in the new dash...

Looked at playtime per user per day and yes, off as well, for the same day, it shows completely different data. Screenshots below: