A/b test has a profound statistical knowledge behind it, and today we talk about the common Simpson paradox.
The Simpson paradox (Simpson's Paradox) is the paradox of the British statistician E.H Simpson (E.h.simpson), in 1951, that two sets of data under a certain condition would satisfy a certain nature when discussed separately, but once taken into account, may lead to the opposite conclusion.
Take a simple little example of the Simpson paradox: There are two colleges in a university with business schools and law schools. Girls in the two colleges complained of "higher acceptance rates for boys than girls" and gender discrimination. But the school to do the total admission rate statistics, but found that the overall female admission rate is far higher than the male admission rate!
Business School boys admitted 75% higher than the female admission rate of 49%, the law school male acceptance rate of 10% is higher than the female admission rate of 5%, but in total, the male admission rate is only 21%, only half of the female admission rate of 42%.
Why two colleges are boys admission rate is higher than the female admission rate, but the addition of male acceptance rate is less than the female admission rate? Mainly because the proportion of the two colleges and women is very different, the specific statistical principles we will be detailed analysis later.
This bizarre (Counter intuitive) phenomenon is often overlooked in real life, after all it is just a statistical phenomenon that generally does not affect our actions. But for business decision-makers who are experimenting with scientific A/B testing, if they do not understand the Simpson paradox, they may mistakenly design experiments and blindly interpret the conclusions of the experiment, thus adversely affecting decision-making.
We use a real medical A/b test case to illustrate the problem. This is a A/b test result of a kidney stone surgical treatment:
It seems that both for large stones and small stones, A therapy is better than B therapy. But in total, it seems that B therapy is better than A therapy.
This A/b test concludes that there are huge problems, whether it is from the segmentation results, or from the total results, can not really judge which therapy is good.
So where is the problem? this A/b test of the two experimental group of medical records selected problems, are not enough representative. the doctors involved in the experiment made two of their own not similar to the test group, because the doctor seems to feel more serious patients more suitable for A therapy, the patients with mild illness is more suitable for the B therapy, so subconsciously in the random allocation of patients, let A group inside A large stone medical records to more, and B group inside the summary of more stone records.
The more important question is that the most important factor that is likely to affect the patient's recovery rate is not the choice of therapy, but the severity of the illness! In other words, a therapy does not look as good as B therapy, mainly because patients in Group A have more serious illnesses, not because group A patients use a therapy.
Therefore, this group of unsuccessful A/b test, the problem in the test flow segmentation is not scientific, mainly because traffic segmentation ignores an important "hidden factors", that is, the severity of the disease. The correct test implementation, in both groups of patients, the proportion of seriously ill patients should be consistent.
Because many people tend to ignore the Simpson paradox, so that some people can use this method to opportunistic. For example, a total winning rate is better than 100 matches. Trickery of the person to find a master Challenge 20 games, win 1 games, another 80 games to find a tie challenge, win 40 games, the result of 41%, serious person is a master Challenge 80 games and win 8 games, and the remaining 20 draw a victory, the result is 28%, a lot less than 41%. But look closely at the challenges and the latter are clearly more powerful.
From these examples of Simpson paradox, Lenovo's practice of Internet product operations, a very common example of miscalculation is this: Take 1% users ran a test, found that the trial version of the purchase rate is higher than the control version, said the trial version is better, we want to release the trial version. In fact, it may only be in our pilot group in the circle of some love to buy users only. The final release of the trial version may instead reduce the user experience and may even result in a decline in user retention and revenue levels.
So how can we avoid the great pits that the Simpson paradox caused by the design, implementation, and analysis of A/B testing?
The most important point is that to obtain a scientific and credible A/b test results, it is necessary to properly carry out the correct traffic segmentation, to ensure that the experimental group and control groups in the user characteristics are consistent, and are representative, can represent the overall user characteristics. This problem has always been a appadhoc of technology, A/B testing cloud services to focus on research and solve the problem.
This article Yell Technology founder and CEO Wang Ye
Authorized reprint from Yell Tech blog
On the Simpson paradox common in A/B testing, business decision makers must look