Author:weimin, Jason Wang
Summary
Online controlled A/B testing is A common practice for companies Likemicrosoft, Amazon, Google and Yahoo! To evaluate the E Ffectiveness of Featuresimprovement. This business strategy are also widely used in EBay searchscience, merchandizing, Shipping and other domains to infer the C Ausalrelationship between algorithm Changesand financial gain. As the name implies, the equal size groups of user, Onegroup is assigned to version A, usually the existing algorithm (CAL LED Controlgroup), and the other are exposed to version B, the new algorithm (called Treatmentgroup), while other variables is identical. Feature launch decision is made ifthe new algorithm significantly increases mean of Gross Mechanize bought (GMB).
However, in ebay, there is a small number of users who shophigh-end products, and a very large numbers of users Who purchase low priceproducts, even do no purchase at all. This long–tailed distribution of GMB increasesthe magnitude of noise when detecting the treatment effect of A/b test. Toimprove the test sensitivity, variance reduction techniques such as capped mean and toso-tailhad been applied on mean of GMB estimation previously. This paper introducesanother variance reduction technique called post-stratification to furtherimprove test sensitivity.< /p>
Post-stratification inspires from stratification in sampling theory. Stratified sampling outperforms simple random sampling when units from the Samestrata is similar to all other regarding With the interest of measurement. as users arrive over time in live trafficexperimentation, though it's impossible to sample user from a PRE-FORMEDST Rata, sensitivity of hypothesis testing still benefits from Stratificationafter data collection. This is called post-stratification adjustment. To adjust mean of GMB during experimentationperiod, users ' pre-experimentation period GMB is collected to buckets users. The underlying assumption is and the users ' purchase behavior is predictable giventheir historical behavior, says, a frequent High-end purchaser Beforeexperiment period is also likely to be a heavy buyer during experiment period. In implementation, to further improve the magnitude of variance reduction, gmblift are decomposed to the combination GMB PE R Purchaser Lift and percentage ofpurchaseR out of the users lift, so-purchasers can be modeled separately. Thereason is obvious, it's difficult to track non-purchasers if they does not signin.
The post-stratified metrics GMB were rolled out on EP, a central placeof experimentation platform on Nov 2014. Since then, the more experiments go from insignificant to significantand more new algorithms is launch-able. In sum, the post-stratificationadjusted metrics are a valuable improvement, which saves experimentationresources, speeds up Testing pace and supports launching more profitable newfeatures on EBay.
Post-stratification variancereduction Technique
In this section, we'll theoretically show that the variance isreduced using post-stratification adjustment. Let's denote Y as the target, gmbin our case. is the sample size. X is the auxiliaryvariable, which is known. T and C represent treatment group and control grouprespectively. and represents treatment effect. T test based onsample mean is used to test the significant of treatment effect before the rollout of Post-stratified Metri C. We call thisas Sample Average treatment Effect (SATE),
· Post Stratification Adjustment
From above formula, we can see that thevariance of sample mean are split into Within-strata variation andbetween-strata var Iation, and the Between-strata variance is removed bystratification. The more homogenous of target within groups and the moreheterogeneous between groups, the better variance reduction can be Achievedusing stratification.
· Regression Adjustment
is the correlation between X and Y for each Treatment/control group. The higher the correlation between X and Y, the better the variance reduction could be achieved. One thing needs to point out, Thoughis unbiased,is biased by, which was in order of. When the sample size is large,is unbiased empirically. In Deng ' s[1] paper, was estimated by pooled treatment and control group together, which was also the coefficient of X when Fitting regression of Yon X using all units in the groups, says,.
From Lin's simulation study[3], the result by using Pooledθand using differentθfor treatment and control does not diffe R significantly, but when treatment and control has unbalanced sample size,using differentθis more accurate.
Post-stratificationadjusted GMB
Covariate Selection
To apply post stratification effectively, the variables selected forforming strata is critical. According to both Deng ' s paper[1] and Miratrix ' paper[2],the higher the correlation between the covariates and the VARIABL E of interesting, the greater the variance reduced. And the covariates should is independentof treatment to avoid bias. For example, the users ' in-experiment purchase ishighly correlated with in-experiment clicks, but in-experiment clicks cannot beused to group users, which is because in-experiment clicks also been impactedby treatment effects. As a matter of fact, covariates that correlated with gmbbut independent with treatment allocation varies, such as geograph IC anddemographic information, user segmentation and preference derived fromhistorical purchase so on. Based on our experimentations, users ' in-experiment (IN-EXPT) purchase are mostly correlated with pre-experiment (PRE-EXPT) p Urchase, which is-inline with Deng's conclusion from Microsoft A/B testing trials. Of course, strata can be formed by more than one covariate. Multiple covariates or covariate combination works better than simple covariate.
2-lift model
Based on a empirical study by David. G[4], GMB lift can be decomposed into a sum of both terms:participation rate lift (fraction of GUIDs of the Who make a purchase) -and GMB per purchaser. Effective stratification requires modeling, and we can do this more effectively by developing a separate model for each TE RM, thus improving variance reduction.
Post Stratification Implementation
Stratification works best in combination with some control on outliers. In other words, when Psot-stratificationadjustment was applied on outliers processed data, the variance reduction is much l Arger than applying on raw data directly. A simple solution are to capoutliers at 99.9% percentile of GMB from purchasers. Then and is modeled separately.
: A binary indicator of purchase or not are usually highly correlated with how to frequent the user was active on ebay besides of purchased amount before experimentation period. Therefore, PRE-EXPT active days and PRE-EXPT GMB combined together to form the strata for treatment effect of fraction of Purchaser post-stratification Adjustment.
:
Purchase amount isusually correlated with users ' historical purchased amount. For PURCHASERS,PRE-EXPT GMB are used to adjust in-expt GMB by regression. To being Specific,overall treatment effect for Average GMB per Participant are estimated as, F is the fitted regression model .
Impact on decision Making
Post-stratified metrics were rolled out in EP, a central place ofexperimentation platform on Nov 2014. Till March, the standard error OFGMB lift estimation are reduced by ~5% for 96% out of total 228 experiments. As the increasing of test sensitivity, GMB lifts of 15more experiments go from insignificant to Significant,which implies New algorithms is launch-able. In sum, the Post-stratification adjusted metrics is a valuable improvement, which saves experimentation resources, speed-u P testing pace and supports launching more profitable new features on EBay.
Discussion
Since the launch ofpost-stratification adjusted GMB, we found that there is lift delta between unadjusted GMB lift (SATE) And adjusted GMB lift for a few experimentations. For example, in Figure 2, the distribution of GMB lift estimators using SATE (reddensity) and post-stratification (green D ensity) is compared. The variance of ADJUSTEDGMB lift is smaller than unadjusted GMB lift (the green density is thinner andhigher than red dens ity), which is as expected. However, we do see lift deltaexist in Experiment 4254 and 4406 (red Dash line was not overlapped with Greendash line). What might is the reason?
Figure 1: Variance Reduction–densitycomparison.
Post-stratification Adjustment was valid when Covariate X was independent with treatment effect. The reason that Pre-experimentation features be selected is that treatment effect have not been introduced to pre experime NT period yet and the expect value of X is the same in treatment and control group, says,,then the adjusted metrics would Be biased.
In Figure 3, distributions of the Bootstrappedmean of PRE-EXPT GMB for test and control group is compared. Red and Green Dashlines is estimations of the expected value of PRE-EXPT GMB for treatment Andcontrol group respectively. It is clear-to-see, the 2 tests in experiment4303 without lift delta, the expectations or PRE-EXPT GMB was equal for Testand control. But experiment 4254 and 4406 with lift delta, the bootstrapped means is different in treatment and control groups.
Further improvement
asmentioned above, when using regression to adjust treatment effect,,
Could is extended to a more sophisticatedmodel, such as multiple variable linear regression, random forest, gradientbooste D trees and so on. Our expectation are that just as using a singlecovariate in a linear model are better than using it to construct Strata,simi Larly using multiple covariates would perform better than a singlecovariate, and a non-linear model would perform even Bette R. So a more sophisticated model to PREDICTIN-EXPT GMB using PRE-EXPT features would be a potential improvement of Varianc e reduction using post-stratification adjustment.
Reference
[1] Alex Deng, Ya Xu and Ron Kohavi. Improvingthe sensitivity of Online controlled experiments by utilizing pre-experimentdata. To appear in WSDM.
[2] Luke W. Miratrix, Jasjeet S. Sekhon andbin Yu. Adjust treatment Effect estimates by post-stratification in Randomizedexperiments. Journal of the Royal Statistical Society, Series B. August 10,2012.
[3] Winston Lin agnostic Notes on regressionadjustments to experiment data:reexamining Freedman ' s critique. July 26,2012.
[4] David Goldberg. A Two-modeltransfer Function. eBay internaltechnical report.
[5] David Goldberg. Stratificationand the Deng et al paper. eBay internaltechnical report.
A/b Test sensitivity improvement by Using post-stratification