Multiple testing#
Multiple hypothesis testing problem#
Note
This guide uses Polars as an example data backend. To be able to reproduce the example code, install Polars in addition to tea-tasting:
pip install polars
The multiple hypothesis testing problem arises when there is more than one success metric or more than one treatment variant in an A/B test.
tea-tasting provides the following methods for multiple testing correction:
- False discovery rate (FDR) controlling procedures:
- Benjamini-Hochberg procedure, assuming non-negative correlation between hypotheses.
- Benjamini-Yekutieli procedure, assuming arbitrary dependence between hypotheses.
- Family-wise error rate (FWER) controlling procedures:
- Hochberg's step-up procedure, assuming non-negative correlation between hypotheses.
- Holm's step-down procedure, assuming arbitrary dependence between hypotheses.
As an example, consider an experiment with three variants, a control and two treatments:
>>> import polars as pl
>>> import tea_tasting as tt
>>> data = pl.concat((
... tt.make_users_data(
... seed=42,
... orders_uplift=0.10,
... revenue_uplift=0.15,
... return_type="polars",
... ),
... tt.make_users_data(
... seed=21,
... orders_uplift=0.15,
... revenue_uplift=0.20,
... return_type="polars",
... )
... .filter(pl.col("variant").eq(1))
... .with_columns(variant=pl.lit(2, pl.Int64)),
... ))
>>> print(data)
shape: (6_046, 5)
┌──────┬─────────┬──────────┬────────┬─────────┐
│ user ┆ variant ┆ sessions ┆ orders ┆ revenue │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
╞══════╪═════════╪══════════╪════════╪═════════╡
│ 0 ┆ 1 ┆ 2 ┆ 1 ┆ 9.58 │
│ 1 ┆ 0 ┆ 2 ┆ 1 ┆ 6.43 │
│ 2 ┆ 1 ┆ 2 ┆ 1 ┆ 8.3 │
│ 3 ┆ 1 ┆ 2 ┆ 1 ┆ 16.65 │
│ 4 ┆ 0 ┆ 1 ┆ 1 ┆ 7.14 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 3989 ┆ 2 ┆ 4 ┆ 4 ┆ 34.93 │
│ 3991 ┆ 2 ┆ 1 ┆ 0 ┆ 0.0 │
│ 3992 ┆ 2 ┆ 3 ┆ 3 ┆ 27.96 │
│ 3994 ┆ 2 ┆ 2 ┆ 1 ┆ 17.22 │
│ 3998 ┆ 2 ┆ 3 ┆ 0 ┆ 0.0 │
└──────┴─────────┴──────────┴────────┴─────────┘
Let's calculate the experiment results:
>>> experiment = tt.Experiment(
... sessions_per_user=tt.Mean("sessions"),
... orders_per_session=tt.RatioOfMeans("orders", "sessions"),
... orders_per_user=tt.Mean("orders"),
... revenue_per_user=tt.Mean("revenue"),
... )
>>> results = experiment.analyze(data, control=0, all_variants=True)
>>> print(results)
variants metric control treatment rel_effect_size rel_effect_size_ci pvalue
(0, 1) sessions_per_user 2.00 1.98 -0.66% [-3.7%, 2.5%] 0.674
(0, 1) orders_per_session 0.266 0.289 8.8% [-0.89%, 19%] 0.0762
(0, 1) orders_per_user 0.530 0.573 8.0% [-2.0%, 19%] 0.118
(0, 1) revenue_per_user 5.24 5.99 14% [2.1%, 28%] 0.0211
(0, 2) sessions_per_user 2.00 2.02 0.98% [-2.1%, 4.1%] 0.532
(0, 2) orders_per_session 0.266 0.295 11% [1.2%, 22%] 0.0273
(0, 2) orders_per_user 0.530 0.594 12% [1.7%, 23%] 0.0213
(0, 2) revenue_per_user 5.24 6.25 19% [6.6%, 33%] 0.00218
Suppose only the two metrics orders_per_user
and revenue_per_user
are considered as success metrics, while the other two metrics sessions_per_user
and orders_per_session
are second-order diagnostic metrics.
>>> metrics = {"orders_per_user", "revenue_per_user"}
With two treatment variants and two success metrics, there are four hypotheses in total, which increases the probability of false positives (also called "false discoveries"). It's recommended to adjust the p-values or the significance level (alpha) in this case. Let's explore the correction methods provided by tea-tasting.
False discovery rate#
False discovery rate (FDR) is the expected value of the proportion of false discoveries among the discoveries (rejections of the null hypothesis). To control for FDR, use the adjust_fdr
method:
>>> adjusted_results_fdr = tt.adjust_fdr(results, metrics)
>>> print(adjusted_results_fdr)
comparison metric control treatment rel_effect_size pvalue pvalue_adj
(0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
(0, 1) revenue_per_user 5.24 5.99 14% 0.0211 0.0284
(0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0284
(0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00872
The method adjusts p-values and saves them as pvalue_adj
. Compare these values to the desired significance level alpha to determine if the null hypotheses can be rejected.
The method also adjusts the significance level alpha and saves it as alpha_adj
. Compare non-adjusted p-values (pvalue
) to the alpha_adj
to determine if the null hypotheses can be rejected:
>>> print(adjusted_results_fdr.to_string(keys=(
... "comparison",
... "metric",
... "control",
... "treatment",
... "rel_effect_size",
... "pvalue",
... "alpha_adj",
... )))
comparison metric control treatment rel_effect_size pvalue alpha_adj
(0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.0500
(0, 1) revenue_per_user 5.24 5.99 14% 0.0211 0.0375
(0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0375
(0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.0375
By default, tea-tasting assumes non-negative correlation between hypotheses and performs the Benjamini-Hochberg procedure. To perform the Benjamini-Yekutieli procedure, assuming arbitrary dependence between hypotheses, set the arbitrary_dependence
parameter to True
:
>>> print(tt.adjust_fdr(results, metrics, arbitrary_dependence=True))
comparison metric control treatment rel_effect_size pvalue pvalue_adj
(0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.245
(0, 1) revenue_per_user 5.24 5.99 14% 0.0211 0.0592
(0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0592
(0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.0182
Family-wise error rate#
Family-wise error rate (FWER) is the probability of making at least one type I error. To control for FWER, use the adjust_fwer
method:
>>> print(tt.adjust_fwer(results, metrics))
comparison metric control treatment rel_effect_size pvalue pvalue_adj
(0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
(0, 1) revenue_per_user 5.24 5.99 14% 0.0211 0.0422
(0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0422
(0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00869
By default, tea-tasting assumes non-negative correlation between hypotheses and performs the Hochberg's step-up procedure with the Šidák correction, which is slightly more powerful than the Bonferroni correction.
To perform the Holm's step-down procedure, assuming arbitrary dependence between hypotheses, set the arbitrary_dependence
parameter to True
. In this case, it's recommended to use the Bonferroni correction, since the Šidák correction assumes non-negative correlation between hypotheses:
>>> print(tt.adjust_fwer(
... results,
... metrics,
... arbitrary_dependence=True,
... method="bonferroni",
... ))
comparison metric control treatment rel_effect_size pvalue pvalue_adj
(0, 1) orders_per_user 0.530 0.573 8.0% 0.118 0.118
(0, 1) revenue_per_user 5.24 5.99 14% 0.0211 0.0634
(0, 2) orders_per_user 0.530 0.594 12% 0.0213 0.0634
(0, 2) revenue_per_user 5.24 6.25 19% 0.00218 0.00872
Other inputs#
In the examples above, the methods adjust_fdr
and adjust_fwer
received results from a single experiment with more than two variants. They can also accept the results from multiple experiments with two variants in each:
>>> data1 = tt.make_users_data(seed=42, orders_uplift=0.10, revenue_uplift=0.15)
>>> data2 = tt.make_users_data(seed=21, orders_uplift=0.15, revenue_uplift=0.20)
>>> result1 = experiment.analyze(data1)
>>> result2 = experiment.analyze(data2)
>>> print(tt.adjust_fdr(
... {"Experiment 1": result1, "Experiment 2": result2},
... metrics,
... ))
comparison metric control treatment rel_effect_size pvalue pvalue_adj
Experiment 1 orders_per_user 0.530 0.573 8.0% 0.118 0.118
Experiment 1 revenue_per_user 5.24 5.99 14% 0.0211 0.0282
Experiment 2 orders_per_user 0.514 0.594 16% 0.00427 0.00853
Experiment 2 revenue_per_user 5.10 6.25 22% 6.27e-04 0.00251
The methods adjust_fdr
and adjust_fwer
can also accept the result of a single experiment with two variants:
>>> print(tt.adjust_fwer(result2, metrics))
comparison metric control treatment rel_effect_size pvalue pvalue_adj
- orders_per_user 0.514 0.594 16% 0.00427 0.00427
- revenue_per_user 5.10 6.25 22% 6.27e-04 0.00125