Skip to content

User guide#

Installation#

pip install tea-tasting

Basic usage#

Begin with this simple example to understand the basic functionality:

import tea_tasting as tt


data = tt.make_users_data(seed=42)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions"),
    orders_per_user=tt.Mean("orders"),
    revenue_per_user=tt.Mean("revenue"),
)

result = experiment.analyze(data)
print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
#>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
#> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762
#>    orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
#>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

In the following sections, each step of this process is explained in detail.

Input data#

The make_users_data function creates synthetic data for demonstration purposes. This data mimics what you might encounter in an A/B test for an online store. Each row represents an individual user, with the following columns:

  • user: The unique identifier for each user.
  • variant: The specific variant (e.g., 0 or 1) assigned to each user in the A/B test.
  • sessions: The total number of user's sessions.
  • orders: The total number of user's orders.
  • revenue: The total revenue generated by the user.

tea-tasting can process data in the form of either a Pandas DataFrame or an Ibis Table. Ibis is a Python package that serves as a DataFrame API to various data backends. It supports 20+ backends including BigQuery, ClickHouse, DuckDB, Polars, PostgreSQL, Snowflake, Spark etc. You can write an SQL query, wrap it as an Ibis Table and pass it to tea-tasting.

Many statistical tests, such as the Student's t-test or the Z-test, require only aggregated data for analysis. For these tests, tea-tasting retrieves only aggregated statistics like mean and variance instead of downloading all detailed data. See more details in the guide on data backends.

tea-tasting assumes that:

  • Data is grouped by randomization units, such as individual users.
  • There is a column indicating the variant of the A/B test (typically labeled as A, B, etc.).
  • All necessary columns for metric calculations (like the number of orders, revenue, etc.) are included in the table.

A/B test definition#

The Experiment class defines parameters of an A/B test: metrics and a variant column name. There are two ways to define metrics:

  • Using keyword parameters, with metric names as parameter names, and metric definitions as parameter values, as in example above.
  • Using the first argument metrics which accepts metrics in a form of dictionary with metric names as keys and metric definitions as values.

By default, tea-testing assumes that the A/B test variant is stored in a column named "variant". You can change it, using the variant parameter of the Experiment class.

Example usage:

experiment = tt.Experiment(
    {
        "sessions per user": tt.Mean("sessions"),
        "orders per session": tt.RatioOfMeans("orders", "sessions"),
        "orders per user": tt.Mean("orders"),
        "revenue per user": tt.Mean("revenue"),
    },
    variant="variant",
)

Metrics#

Metrics are instances of metric classes which define how metrics are calculated. Those calculations include calculation of effect size, confidence interval, p-value and other statistics.

Use the Mean class to compare averages between variants of an A/B test. For example, average number of orders per user, where user is a randomization unit of an experiment. Specify the column containing the metric values using the first parameter value.

Use the RatioOfMeans class to compare ratios of averages between variants of an A/B test. For example, average number of orders per average number of sessions. Specify the columns containing the numerator and denominator values using parameters numer and denom.

Use the following parameters of Mean and RatioOfMeans to customize the analysis:

  • alternative: Alternative hypothesis. The following options are available:
    • "two-sided" (default): the means are unequal.
    • "greater": the mean in the treatment variant is greater than the mean in the control variant.
    • "less": the mean in the treatment variant is less than the mean in the control variant.
  • confidence_level: Confidence level of the confidence interval. Default is 0.95.
  • equal_var: Defines whether equal variance is assumed. If True, pooled variance is used for the calculation of the standard error of the difference between two means. Default is False.
  • use_t: Defines whether to use the Student's t-distribution (True) or the Normal distribution (False). Default is True.

Example usage:

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions", alternative="greater"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions", confidence_level=0.9),
    orders_per_user=tt.Mean("orders", equal_var=True),
    revenue_per_user=tt.Mean("revenue", use_t=False),
)

Look for other supported metrics in the Metrics reference.

You can change default values of these four parameters using the global settings.

Analyzing and retrieving experiment results#

After defining an experiment and metrics, you can analyze the experiment data using the analyze method of the Experiment class. This method takes data as an input and returns an ExperimentResult object with experiment result.

result = experiment.analyze(data)

By default, tea-tasting assumes that the variant with the lowest ID is a control. Change default behavior using the control parameter:

result = experiment.analyze(data, control=0)

ExperimentResult is a mapping. Get a metric's analysis result using metric name as a key.

print(result["orders_per_user"])
#> MeanResult(control=0.5304003954522986, treatment=0.5730905412240769,
#> effect_size=0.04269014577177832, effect_size_ci_lower=-0.010800201598205564,
#> effect_size_ci_upper=0.0961804931417622, rel_effect_size=0.08048664016431273,
#> rel_effect_size_ci_lower=-0.019515294044062048,
#> rel_effect_size_ci_upper=0.19068800612788883, pvalue=0.11773177998716244,
#> statistic=1.5647028839586694)

Fields in result depend on metrics. For Mean and RatioOfMeans, the fields include:

  • metric: Metric name.
  • control: Mean or ratio of means in the control variant.
  • treatment: Mean or ratio of means in the treatment variant.
  • effect_size: Absolute effect size. Difference between two means.
  • effect_size_ci_lower: Lower bound of the absolute effect size confidence interval.
  • effect_size_ci_upper: Upper bound of the absolute effect size confidence interval.
  • rel_effect_size: Relative effect size. Difference between two means, divided by the control mean.
  • rel_effect_size_ci_lower: Lower bound of the relative effect size confidence interval.
  • rel_effect_size_ci_upper: Upper bound of the relative effect size confidence interval.
  • pvalue: P-value
  • statistic: Statistic (standardized effect size).

ExperimentResult provides the following methods to serialize and view the experiment result:

  • to_dicts: Convert the result to a sequence of dictionaries.
  • to_pandas: Convert the result to a Pandas DataFrame.
  • to_pretty: Convert the result to a Pandas Dataframe with formatted values (as strings).
  • to_string: Convert the result to a string.
  • to_html: Convert the result to HTML.

print(result) is the same as print(result.to_string()).

print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
#>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
#> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762
#>    orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
#>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

By default, methods to_pretty, to_string, and to_html return a predefined list of attributes. This list can be customized:

print(result.to_string(names=(
    "control",
    "treatment",
    "effect_size",
    "effect_size_ci",
)))
#>             metric control treatment effect_size     effect_size_ci
#>  sessions_per_user    2.00      1.98     -0.0132  [-0.0750, 0.0485]
#> orders_per_session   0.266     0.289      0.0233 [-0.00246, 0.0491]
#>    orders_per_user   0.530     0.573      0.0427  [-0.0108, 0.0962]
#>   revenue_per_user    5.24      5.73       0.489     [-0.133, 1.11]

In Jupyter and IPython, the output of the line result will be a rendered HTML table.

More features#

Variance reduction with CUPED/CUPAC#

tea-tasting supports variance reduction with CUPED/CUPAC, within both Mean and RatioOfMeans classes.

Example usage:

import tea_tasting as tt


data = tt.make_users_data(seed=42, covariates=True)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions", "sessions_covariate"),
    orders_per_session=tt.RatioOfMeans(
        numer="orders",
        denom="sessions",
        numer_covariate="orders_covariate",
        denom_covariate="sessions_covariate",
    ),
    orders_per_user=tt.Mean("orders", "orders_covariate"),
    revenue_per_user=tt.Mean("revenue", "revenue_covariate"),
)

result = experiment.analyze(data)
print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci  pvalue
#>  sessions_per_user    2.00      1.98          -0.68%      [-3.2%, 1.9%]   0.603
#> orders_per_session   0.262     0.293             12%        [4.2%, 21%] 0.00229
#>    orders_per_user   0.523     0.581             11%        [2.9%, 20%] 0.00733
#>   revenue_per_user    5.12      5.85             14%        [3.8%, 26%] 0.00675

Set the covariates parameter of the make_users_data functions to True to add the following columns with pre-experimental data:

  • sessions_covariate: Number of sessions before the experiment.
  • orders_covariate: Number of orders before the experiment.
  • revenue_covariate: Revenue before the experiment.

Define the metrics' covariates:

  • In Mean, specify the covariate using the covariate parameter.
  • In RatioOfMeans, specify the covariates for the numerator and denominator using the numer_covariate and denom_covariate parameters, respectively.

Sample ratio mismatch check#

The SampleRatio class in tea-tasting detects mismatches in the sample ratios of different variants of an A/B test.

Example usage:

import tea_tasting as tt


experiment = tt.Experiment(
    sample_ratio=tt.SampleRatio(),
)

data = tt.make_users_data(seed=42)
result = experiment.analyze(data)
print(result.to_string(("control", "treatment", "pvalue")))
#>       metric control treatment pvalue
#> sample_ratio    2023      1977  0.477

By default, SampleRatio expects equal number of observations across all variants. To specify a different ratio, use the ratio parameter. It accepts two types of values:

  • Ratio of the number of observation in treatment relative to control, as a positive number. Example: SampleRatio(0.5).
  • A dictionary with variants as keys and expected ratios as values. Example: SampleRatio({"A": 2, "B": 1}).

The method parameter determines the statistical test to apply:

  • "auto": Apply exact binomial test if the total number of observations is less than 1000, or normal approximation otherwise.
  • "binom": Apply exact binomial test.
  • "norm": Apply normal approximation of the binomial distribution.

The result of the sample ratio mismatch includes the following attributes:

  • metric: Metric name.
  • control: Number of observations in control.
  • treatment: Number of observations in treatment.
  • pvalue: P-value

Global settings#

In tea-tasting, you can change defaults for the following parameters:

  • alternative: Alternative hypothesis.
  • confidence_level: Confidence level of the confidence interval.
  • equal_var: If False, assume unequal population variances in calculation of the standard deviation and the number of degrees of freedom. Otherwise, assume equal population variance and calculate pooled standard deviation.
  • n_resamples: The number of resamples performed to form the bootstrap distribution of a statistic.
  • use_t: If True, use Student's t-distribution in p-value and confidence interval calculations. Otherwise use Normal distribution.
  • And more.

Use get_config with the option name as a parameter to get a global option value:

import tea_tasting as tt


tt.get_config("equal_var")
#> False

Use get_config without parameters to get a dictionary of global options:

global_config = tt.get_config()

Use set_config to set a global option value:

tt.set_config(equal_var=True, use_t=False)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions"),
    orders_per_user=tt.Mean("orders"),
    revenue_per_user=tt.Mean("revenue"),
)

experiment.metrics["orders_per_user"]
#> Mean(value='orders', covariate=None, alternative='two-sided',
#> confidence_level=0.95, equal_var=True, use_t=False)

Use config_context to temporarily set a global option value within a context:

with tt.config_context(equal_var=True, use_t=False):
    experiment = tt.Experiment(
        sessions_per_user=tt.Mean("sessions"),
        orders_per_session=tt.RatioOfMeans("orders", "sessions"),
        orders_per_user=tt.Mean("orders"),
        revenue_per_user=tt.Mean("revenue"),
    )

experiment.metrics["orders_per_user"]
#> Mean(value='orders', covariate=None, alternative='two-sided',
#> confidence_level=0.95, equal_var=True, use_t=False)

More than two variants#

In tea-tasting, it's possible to analyze experiments with more than two variants. However, the variants will be compared in pairs through two-sample statistical tests.

How variant pairs are determined:

  • Default control variant: When the control parameter of the analyze method is set to None, tea-tasting automatically compares each variant pair. The variant with the lowest ID in each pair is a control.
  • Specified control variant: If a specific variant is set as control, it is then compared against each of the other variants.

The result of the analysis is a dictionary of ExperimentResult objects with tuples (control, treatment) as keys.

Keep in mind that tea-tasting does not adjust for multiple comparisons. When dealing with multiple variant pairs, additional steps may be necessary to account for this, depending on your analysis needs.