I don't speak often at marketing conferences and that's because my message is not easy to take. For example, one of my talks is titled “The Accountability Paradox in Big Data Marketing.” Google and other digital marketers claim that the ad-tech world is more measurable, and thus more accountable than the old world of TV advertising – they claim that advertisers save money by going digital. The reality is not so. There have been some attention to this problem recently – but far from enough.
Let me illustrate the problems by describing my recent experience running ads on Facebook for Principal Analytics Prep, the analytics bootcamp I recently launched. For a small-time advertiser like us, Facebook presents a channel to reach large numbers of people to build awareness of our new brand.
So far, the results from the ads have been satisfactory but not great. We are quite contented with the effectiveness but wanted to run experiments to get higher volume of “conversions”. This last week, we ran an A/B test to see if different images result in more conversions. We designed a four-way split, so in reality, an A/B/C/D test. One of the test cells (call it D) is the “champion,” i.e. the image that has performed well prior to the test; the other images are new. We launched the test on a Friday.
Two days later, I checked the interim results. Only one of the test cells (A) had any responses. Surprisingly, that test cell A has received about 90% of all “impressions.” Said differently, test cell A received 10 times as many impressions as each of the other three cells. The other test cells were getting such measly allocation that I have lost all confidence in this test.
It turns out that an automated algorithm (what is now labeled A.I.) was behind this craziness. Apparently, this is a well-known problem among people who tried to do so-called split testing on the Facebook Ads platform. See this paragraph from the AdEspresso blog:
This often results in an uneven distribution of the budget where some experiments will receive a lot of impressions and consume most of the budget leaving others under-tested. This is due to Facebook being over aggressive determining which ad is better and driving to it most of the Adset’s budget.
Then one day later, I was shook again when checking the interim report. Suddenly, test cell C got almost all the impressions – due to one conversion that showed up overnight for the C image. Clearly, anyone using this split-testing feature is just fooling themselves.
This is a great example of interesting math that looks good on paper but spectacularly fails in practice. The algorithm that is driving this crazy behavior is most likely something called multi-armed bandits. This method has traditionally been used to study casino behavior but some academics have recently written many papers that argue they are suitable to use in A/B testing. The testing platform in Google Analytics used to do a similar thing – it might still do but I wouldn't know because I avoid that one like the plague as well.
The problem setup is not difficult to understand: in traditional testing as developed by statisticians, you need a certain sample size to be confident that any difference observed between the A and B cells is “statistically significant.” The analyst would wait for the entire sample to be collected before making a judgment on the results. No one wants to wait especially when the interim results are showing a direction in one's favor. This is true in business as in medicine. The pharmaceutical company that is running a clinical trial on a new drug it spent gazillions to develop would love to declare the new drug successful based on interim positive results. Why wait for the entire sample when the first part of the sample gives you the answer you want?
So people come up with justifications for why one should stop a test early. They like to call this a game of “exploration versus exploitation.” They claim that the statistical way of running testing is too focused on exploration; they claim that there is “lost opportunity” because statistical testing does not “exploit” interim results.
They further claim that the multi-armed bandit algorithms solve this problem by optimally balancing exploration and exploitation (don't shoot me, I am only the messenger). In this setting, they allow the allocation of treatment in the A/B test to change continuously in response to interim results. Those cells with higher interim response rates will be allocated more future testing units while those cells with lower interim response rates will be allocated fewer testing units. The allocation of units to treatment continuously shifts throughout the test.
When this paradigm is put in practice, it keeps running into all sorts of problems. One reality is that 80 to 90 percent of all test ideas make no difference, meaning the test version B on average performs just as well as test version A. There is nothing to “exploit.” Any attempted exploitation represents swimming in the noise.
In practice, many tests using this automated algorithm produce absurd results. As AdEspresso pointed out, the algorithm is overly aggressive in shifting impressions to the current “winner.” For my own test, which has very low impressions, it is simply absurd for it to start changing allocation proportions after one or two days. These shifts are driving by single-digit conversions off a small base of impressions. And it then swims around in the noise. Because of such aimless and wasteful “exploitation,” it would have taken me much, much longer to collect enough samples on the other images to definitively make a call!
AdEspresso and others recommend a workaround. Instead of putting the four test images into one campaign, they recommend setting up four campaigns each with one image, and splitting the advertising equally between these campaigns.
Since there is only one image in each campaign, you have effectively turned off the algorithm. When you split the budget equally, each campaign will get similar numbers of impressions.
However, this workaround is also flawed. If you can spot what the issue is, say so in the comments!