Company XYZ is a worldwide e-commerce site with localized versions of the site.
A data scientist at XYZ noticed that Spain-based users have a much higher conversion rate than any other Spanish-speaking country. She therefore went and talked to the international team in charge of Spain And LatAm to see if they had any ideas about why that was happening.
Spain and LatAm country manager suggested that one reason could be translation. All Spanish-speaking countries had the same translation of the site which was written by a Spaniard. They agreed to try a test where each country would have its one translation written by a local. That is, Argentinian users would see a translation written by an Argentinian, Mexican users by a Mexican and so on. Obviously, nothing would change for users from Spain.
After they run the test however, they are really surprised cause the test is negative. I.e., it appears that the non-localized translation was doing better!
You are asked to: 1. Confirm that the test is actually negative. That is, it appears that the old version of the site with just one translation across Spain and LatAm performs better 2. Explain why that might be happening. Are the localized translations really worse?
#Read Data
user = read.csv("user_table.csv")
test = read.csv("test_table.csv")
#Check if user is unique by user id
length(user$user_id)==length(unique(user$user_id))
## [1] TRUE
#Check if test is unique by user id
length(test$user_id)==length(unique(test$user_id))
## [1] TRUE
#we find some user in test not found in user.
identical(test$user_id,user$user_id)
## [1] FALSE
length(user$user_id)-length(test$user_id)
## [1] -454
#Merge user and test tables to one
df=merge(user, test, by = "user_id",all.x = TRUE)
#Format the date
df$date=as.Date(df$date)
summary(df)
## user_id sex age country
## Min. : 1 F:188382 Min. :18.00 Mexico :128484
## 1st Qu.: 249819 M:264485 1st Qu.:22.00 Colombia : 54060
## Median : 500019 Median :26.00 Spain : 51782
## Mean : 499945 Mean :27.13 Argentina: 46733
## 3rd Qu.: 749543 3rd Qu.:31.00 Peru : 33666
## Max. :1000000 Max. :70.00 Venezuela: 32054
## (Other) :106088
## date source device browser_language
## Min. :2015-11-30 Ads :181693 Mobile:201551 EN : 63079
## 1st Qu.:2015-12-01 Direct: 90738 Web :251316 ES :377160
## Median :2015-12-03 SEO :180436 Other: 12628
## Mean :2015-12-02
## 3rd Qu.:2015-12-04
## Max. :2015-12-04
##
## ads_channel browser conversion test
## Bing : 13670 Android_App:154977 Min. :0.00000 Min. :0.0000
## Facebook: 68358 Chrome :101822 1st Qu.:0.00000 1st Qu.:0.0000
## Google : 68113 FireFox : 40721 Median :0.00000 Median :0.0000
## Other : 4143 IE : 61656 Mean :0.04956 Mean :0.4765
## Yahoo : 27409 Iphone_App : 46574 3rd Qu.:0.00000 3rd Qu.:1.0000
## NA's :271174 Opera : 6084 Max. :1.00000 Max. :1.0000
## Safari : 41033
#Make sure Spain having a higher conversion rate
ConversionByCountry=df%>%
group_by(country)%>%
summarise(conversion=mean(conversion[test==0])
)%>%
arrange(desc(conversion))
head(ConversionByCountry)
## Source: local data frame [6 x 2]
##
## country conversion
## (fctr) (dbl)
## 1 Spain 0.07971882
## 2 El Salvador 0.05355404
## 3 Nicaragua 0.05264697
## 4 Costa Rica 0.05225564
## 5 Colombia 0.05208949
## 6 Honduras 0.05090576
#Exclude the Spain users because Spain is not in the test.
control_test=subset(df, country!="Spain")
#T two sample test, find the conversion rate for control group and test group
t.test(control_test$conversion[control_test$test==1],control_test$conversion[control_test$test==0])
##
## Welch Two Sample t-test
##
## data: control_test$conversion[control_test$test == 1] and control_test$conversion[control_test$test == 0]
## t = -7.3539, df = 385260, p-value = 1.929e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.006181421 -0.003579837
## sample estimates:
## mean of x mean of y
## 0.04341116 0.04829179
From T test result, we can see a significant difference between control group conversion rate and test group conversion rate. For the test group, the conversion rate is 0.0434. For the control group, the conversion rate is 0.048, which is 10% higher than the test group. Local translation did worse than the control group.
Some possible reasons for weird A/B testing result are:
First, if we do not have enough data, result would be fluctuating, therefore, we plot the conversion rate by days to check the variance.
data_test_by_day=control_test%>%
group_by(date)%>%
summarize(test_vs_control=mean(conversion[test==1])/mean(conversion[test==0]))
ggplot(data=data_test_by_day,aes(x=date, y=test_vs_control))+
geom_line()+ylab("test/control")+geom_hline(yintercept=1,linetype=2,color="blue")
From the plot, test is always worse than control. That probably means that we do have enough data, but there was some bias in the experiment set up.
Now, it’s time to find out if the test is biased. In an ideal world, the distribution of people in test and control for each segment should be the same. One way is to build a decision tree where the variables are the user dimensions and the outcome variable is whether the user is in test or control. If the tree splits, it means that for given values of that variable you are more likely to end up in test or control. But this should be impossible! Therefore, if the randomization worked, the tree should not split at all (or at least not be able to separate the two classes well).
tree=rpart(test~.,control_test[,-8],control=rpart.control(minbucket=nrow(control_test)/100,max_depth=2))
tree
## n= 401085
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 401085 99692.820 0.5379757
## 2) country=Bolivia,Chile,Colombia,Costa Rica,Ecuador,El Salvador,Guatemala,Honduras,Mexico,Nicaragua,Panama,Paraguay,Peru,Venezuela 350218 87553.970 0.4987693 *
## 3) country=Argentina,Uruguay 50867 7894.097 0.8079108 *
The randomization is perfect for the countries on one side of the split. the test/control ratio is 0.498. but in Argentina and Uruguay together have 80% test and 20% of control.
Check the conversion rate for each country
data_test_by_country=control_test%>%
group_by(country)%>%
summarize(p_value=t.test(conversion[test==1],conversion[test==0])$p.value,
conversion_test=t.test(conversion[test==1],conversion[test==0])$estimate[1],
conversion_control=t.test(conversion[test==1],conversion[test==0])$estimate[2]
)%>%
arrange(p_value)
data_test_by_country
## Source: local data frame [16 x 4]
##
## country p_value conversion_test conversion_control
## (fctr) (dbl) (dbl) (dbl)
## 1 Mexico 0.1655437 0.05118631 0.04949462
## 2 El Salvador 0.2481267 0.04794689 0.05355404
## 3 Chile 0.3028476 0.05129502 0.04810718
## 4 Argentina 0.3351465 0.01372502 0.01507054
## 5 Colombia 0.4237191 0.05057096 0.05208949
## 6 Honduras 0.4714629 0.04753981 0.05090576
## 7 Guatemala 0.5721072 0.04864721 0.05064288
## 8 Venezuela 0.5737015 0.04897831 0.05034367
## 9 Costa Rica 0.6878764 0.05473764 0.05225564
## 10 Panama 0.7053268 0.04937028 0.04679552
## 11 Bolivia 0.7188852 0.04790097 0.04936937
## 12 Peru 0.7719530 0.05060427 0.04991404
## 13 Nicaragua 0.7804004 0.05417676 0.05264697
## 14 Uruguay 0.8797640 0.01290670 0.01204819
## 15 Paraguay 0.8836965 0.04922910 0.04849315
## 16 Ecuador 0.9615117 0.04898842 0.04915381
After we control for country, the test clearly appears non significant. Not a great success given that the goal was to improve conversion rate, but a localized translation did not make worse.