Marketing experiments in 2025: a guide for marketers

Why Marketing Experiments Matter

In today’s digital marketing landscape, businesses face a paradox: they have more data than ever, yet measuring the true impact of marketing efforts remains a challenge. Traditional analytics often mislead marketers into believing correlation implies causation, leading to ineffective budget allocation and suboptimal marketing strategies. Marketing experiments provide a solution—offering a rigorous way to determine what truly drives business growth.

The Problem with Traditional Measurement

Historically, marketers have relied on last-click attribution and multi-touch attribution (MTA) to gauge campaign effectiveness. However, these approaches have limitations:
  • Last-click attribution gives undue credit to the final touchpoint, ignoring prior brand exposure and touchpoints.
  • MTA attempts to track user paths but struggles with privacy restrictions like cookie deprecation and walled gardens (Facebook, Google)

Alternatively MMM (marketing mix modelling) can be used to measure effectiveness – however despite great progress in MMM automation it is still a complex journey suitable for larger brands and will typically require multiple calibrations before they provide reliable results.

Incrementality Measurement: Understanding True Impact, Lift vs Correlation

In digital marketing, one of the biggest challenges is understanding whether a campaign is truly driving additional conversions or if those conversions would have happened regardless. Incrementality measurement helps solve this problem by determining the true lift of a marketing effort beyond what would have naturally occurred. 

Ie. incrementality is the additional value generated by a marketing activity. Unlike correlation, which shows a relationship between variables, incrementality measures causation—whether an action (e.g., running an ad) directly influenced an outcome (e.g., a purchase).

For example, if a brand runs a paid search campaign and sees a spike in sales, it may assume the campaign caused the increase. However, without a proper control group, it is impossible to tell whether those customers would have purchased anyway. Incrementality testing solves this by comparing a test group (exposed to the campaign) with a control group (not exposed) and measuring the difference.

The Problem with Last-Click Attribution & Traditional Attribution Analytics

Many marketers still rely only on attribution models – while sophisticated data driven attribution models are useful they can still struggle with identifying true incremental impact of marketing activities (not to mention the difficulties in obtaining relevant user-level data for upper funnel activities (eg user-level data for Youtube ad views) or even offline activities (TV advertisements) or offline sales that tend to be completely ignored by attribution models.

The Business Case for Experiments

Unlike traditional methods, experiments enable marketers to determine incrementality—the actual lift a campaign provides beyond organic performance. Experiments answer crucial questions like:

Companies that embrace continuous experimentation see 30-45% better ad performance compared to those that don’t. *

…but Experimentation Adoption is Still Low - Your chance to get ahead of the market​

Despite its benefits, many firms still underutilize experiments. Research by Meta indicates that fewer than 25% of advertisers run controlled tests regularly. Stated reasons and barriers include:

  • Fear of lost revenue: Holding out ads for test groups can feel risky – this can be in fact mitigated by good and thoughtful test design where the risk is minimized.
  • Lack of internal expertise: Running rigorous tests requires statistical know-how – this is true to some extent and proper test design and execution are crucial. However there are already tools that simplify the process for marketers significantly
  • Organizational inertia: Some companies still operate on outdated measurement models (or even prefer that marketing effectiveness is not measured at all) and/or are very risk-averse and innovating or embracing test-and-learn approaches is generally hard for them. In this case experiments are just one of many victims. The truth is that this is hard to overcome and it needs to be rectified from the top down.

On the other hand firms that successfully integrate experimentation into their marketing strategy experience achieve:

Fundamentals of Marketing Experiments

Introduction

Marketing experiments are essential for understanding causal relationships—helping marketers distinguish between correlation and true impact. Unlike traditional measurement techniques, experiments provide scientific validation for marketing decisions. In the following paragraphs we lay out the fundamental principles of designing and running effective marketing experiments, ensuring actionable insights and reliable results.

What is a Marketing Experiment?

A marketing experiment is a structured approach to testing a hypothesis by isolating specific variables and measuring their direct impact. At its core, an experiment consists of:

  • A test group exposed to the marketing treatment (e.g., an ad campaign).
  • A control group that does not receive the treatment, allowing for comparison.
  • A success metric that quantifies the effect of the treatment (e.g., conversion rate, revenue lift).

By comparing these groups, marketers can determine whether the intervention (eg new channel or campaign or a change in budget) drove real, incremental change

Reliable and rigorous experiments

Marketing experiments should follow this process to ensure rigor and reliability:

  1. Formulate a Hypothesis – Define a clear, testable statement (e.g., “Increasing spend on Facebook ads by X will lead to a 10% increase in conversions”).
  2. Select Variables – Identify independent (marketing action – change in budget, new campaign, new channel, turning off a channel etc.) and dependent (performance metric – Revenue, New app installs, New customers etc) variables.
  3. Randomize and Isolate – Ensure that groups are statistically equivalent to avoid bias.
  4. Execute – Execute the test. 
  5. Analyze Results – Use statistical techniques to determine if the effect is significant.
  6. Act on conclusions – Apply learnings to optimize future campaigns.

Example 1

A fashion retailer runs a Facebook campaign and wants to measure its impact. They divide their audience randomly into two groups—one sees the ad (test group), and the other doesn’t (control group). After four weeks, they compare conversion rates. If the test group significantly outperforms the control, they can attribute the lift to the campaign.

Example 2

A fintech app company wants to understand whether their video ads drive new acquisitions. They split US states into 2 groups - in one group they run the ads, in the other they don’t. Then they can measure the uplift in the states where the campaign ran.

Key Statistical Concepts for Experimentation

Understanding core statistical principles is crucial for designing sound experiments:

  • Causality vs. Correlation – Just because two events occur together doesn’t mean one caused the other. Other methods like attribution modelling or media mix modelling rely heavily on measuring correlations (and then they use various sophisticated techniques to mitigate the risk of mixing these two up). Experiments on the other hand by design measure causation.
  • Randomization – Ensures fair comparison between test and control groups.
  • Statistical Significance – Determines whether an observed effect is likely due to chance.
  • Confidence Intervals – Provides a range in which the true effect likely falls.
  • Power Analysis – Ensures the sample size is large enough to detect meaningful differences – this is typically done during the test design phase and the outputs include recommended test budget and test duration.

1) Causality vs. Correlation

Just because two things happen together (correlation) doesn’t mean one caused the other (causation). A classic example is ice cream sales and drowning incidents – both tend to increase in the summer, but one doesn’t cause the other. They’re both influenced by a third factor: warm weather.

  • Correlation: Measures the degree to which two variables move together. It can be positive (both increase or decrease together), negative (one increases while the other decreases), or non-existent. Correlation does not prove cause and effect.
  • Causation: Indicates that one event directly leads to another. Experiments are designed to isolate and test causal relationships. By manipulating one variable (the independent variable, like a new website design) and measuring its effect on another (the dependent variable, like conversion rate), while controlling for other factors, we can establish causation. Attribution modeling and media mix modeling often rely on correlations, but they employ statistical techniques (and sometimes assumptions) to infer causality, not directly measure it like experiments do. They are much more susceptible to confounding variables.

2) Randomization

This is the cornerstone of a well-designed experiment. Randomly assigning participants or units to either the test group (exposed to the change) or the control group (not exposed) helps ensure that any differences observed between the groups are likely due to the change being tested, and not pre-existing differences between the groups. Randomization balances out potential confounding variables (known and unknown) across the groups, making the comparison as fair as possible. Different randomization techniques exist (simple, stratified, block), each with its own advantages depending on the experimental setup.

3) Statistical Significance

This helps us determine if the observed difference between the test and control groups is likely a real effect or simply due to random chance. It’s typically expressed as a p-value. A low p-value (e.g., less than 0.05) indicates that the observed result is statistically significant, meaning it’s unlikely to have occurred by chance alone. It does not tell you the size or practical importance of the effect. Statistical significance is often misinterpreted. It’s crucial to remember that it doesn’t guarantee a practically meaningful result.

4) Confidence Intervals

A confidence interval provides a range of values within which the true population effect is likely to fall, with a certain level of confidence (e.g., 95%). For example, a 95% confidence interval for a conversion rate lift might be [2%, 5%]. This means we are 95% confident that the true lift in conversion rate due to the change is somewhere between 2% and 5%. Confidence intervals provide more information than just a p-value by giving a sense of the magnitude of the effect and its uncertainty. A wider interval indicates more uncertainty about the true effect.

5) Power Analysis

This is a crucial step before running an experiment. It determines the minimum sample size needed to detect a statistically significant effect of a given size, with a desired level of confidence (power). Key inputs to a power analysis include:

  • Desired effect size: The minimum difference between the groups that you want to be able to detect. Smaller effect sizes require larger sample sizes.
  • Significance level (alpha): The probability of finding a statistically significant result when there is no real effect (Type I error). Usually set at 0.05.
  • Power (1 – beta): The probability of finding a statistically significant result when a real effect of the specified size exists. Often set at 0.80 or higher.
  • Variability: The amount of variation in the outcome variable. Higher variability requires larger sample sizes.

The output of a power analysis helps determine the required sample size, which translates to test duration and budget. It prevents running underpowered experiments that are unlikely to detect meaningful effects, even if they exist, and also helps avoid running overly long and expensive experiments when a smaller sample size would suffice.

As a business user you don’t need to have rigorous and comprehensive knowledge of these topics – having basic intuitive understanding of the key concepts is enough. On the other hand, those who actually design and analyze the experiments should be well-versed in the relevant statistical methods and experiment design – otherwise you risk the conclusions drawn may be incorrect.

Types of Marketing Experiments

Different experiments suit different marketing goals and use cases. Common types include:

1) One way of categorizing is by how many variants are compared in the test:

I. A/B Testing​

Compares two versions of an asset (ad, creative,..), offer type or budget levels to measure performance differences. There are multiple ways to construct the test/control groups – eg based on users or based on geographical locations.

II. Multivariate Testing​

Tests multiple variables simultaneously to identify the best combination.

2) Another way to classify experiments is to look at how test/control groups are created:

III. User-based Experiment

Test and control groups are based on individual users, this method is often employed by in-platform solutions for testing (Google, Meta). They rely on having access to individual user data and identification (eg being able to identify the real user across multiple devices). For certain cases they are the best solution, for many others they are unusable in practice.

IV. Geo-based Experiments

 A method of constructing test and control groups by using different geo locations. For illustration you can imagine running a campaign in certain US states and not running it in others – this would be a geo-experiment. Geo-experiments are probably the most frequently used type of experiment in measuring marketing effectiveness.

V. Time-based Experiments

Here there is no “real” control group – sometimes it is not possible to have one (e.g. when a media type only allows nation-wide targeting – such as TV in many countries). In this case there are ways to construct “artificial control groups” using some clever statistics that approximate what would happen without the intervention (campaign). This type of test is the least reliable but sometimes there is no other option.

Incrementality: Understanding True Impact, Measuring Lift vs Correlation

In digital marketing, one of the biggest challenges is understanding whether a campaign is truly driving additional conversions or if those conversions would have happened regardless. Incrementality measurement helps solve this problem by determining the true lift of a marketing effort beyond what would have naturally occurred. 

Ie incrementality is the additional value generated by a marketing activity. Unlike correlation, which shows a relationship between variables (attribution models are usually heavily based on correlation), incrementality measures causation—whether an action (e.g., running an ad) directly influenced an outcome (e.g., a purchase).

For example, if a brand runs a paid search campaign and Google Ads or Google Analytics 4 attribute some sales to this campaign, it may assume the campaign caused the sales. However, without a proper control group, it is impossible to tell whether those customers would have purchased anyway. Incrementality testing solves this by comparing a test group (exposed to the campaign) with a control group (not exposed) and measuring the difference.

How Incrementality Testing & Experiments Fits into Modern Measurement Frameworks

Sophisticated marketers use all 3 major methods of marketing measurement:

MMM

  • To get a holistic view of all major marketing efforts and other demand drivers (pricing, promotions, competition etc)
  • MMM’s are used for planning, budget allocation optimization and overall marketing effectiveness reporting.
  • Traditional MMMs were updated 1-4x year but modern MMM solutions make it easy to refresh the model and results weekly. Thus today MMMs can already provide up-to-date results reflecting recent performance changes

Incrementality & Experiments

  • Are indispensable to validate and calibrate MMM and MTA results - in this way they complement MMM and MTA
  • Are the only way to measure true incrementality and causal impact of marketing activities
  • Even if you have MMM in place, experiments are great for measuring specific interventions (free delivery promotion, change in pricing etc)

MTA / Attribution Model​

  • To get tactical and granular insights for digital campaigns
  • MTAs are best used for daily operations in digital marketing

However, you don’t have to start with all three at once, in our experience

  • If you are a smaller brand (spending <30k USD monthly on media) start using MTA + occasionally experiments for large changes or major campaigns
  • If you are a mid-sized advertiser (30-100k monthly spend) start using experiments systematically
  • Once you get to 100-150k+ USD of monthly media spend, start thinking about MMM

Geo-experiments - the workhorse of marketing tests

Let’s take a look at the most common type of experiment you will encounter – geo experiments:

  • These tests do not split individual users into test and control groups but instead work on a less granular level – they split geo-locations.
  • These are usable for platforms that allow some form of location targeting – most digital advertising platforms already have this capability even though the granularity of targeting may differ
  • In geo-based experiments we split the outcome metric (eg total revenue) into locations (eg regions, cities, zip codes) and from these we form geo-based test and control groups. Then in the test locations we perform the change (eg we turn on the newly tested channel or we reduce the spend on existing by a certain amount etc) and try to keep all other parameters unchanged – ie the tested change should be the only thing that we changed between test and control groups. The mechanism how to select test and control groups is part of the test design process – generally you want the test and control be as similar as possible (where modern approaches often use so called synthetic control methods where they create an artificial unit by combining control units with various weights and achieving a “control” that is more similar to the tested locations that any single control location) – this process is usually taken care of by dedicated statistical tooling.
  • After the test duration we basically compare test and control outcome metric – eg if the newly tested channel brings incremental revenue we should see an uplift in the tested locations vs the control locations (libraries below help with this process) 
  • Geo experiments are the primary method for establishing channel incrementality / incremental ROI – they are both rigorous and privacy-safe.

Geo experiments are great for testing the effect of a change / intervention in an advertising platform on your total results, eg:

  • Change in spend level on total revenue (online or even offline revenue)
  • Turning a new type of campaign and its effect on customer acquisition

In geo experiments the dependent variable is not biased by any attribution – instead you measure the impact on the total value, so you are measuring the impact of eg TikTok ads on your total (real) revenue, not on what TikTok thinks the revenue is (its platform attribution) or what GA4 thinks the revenue is. This independence on any attribution is a key advantage of this test type. 

You need to be able to observe the dependent variable (eg total revenue, total new customers etc) on geographical level – at the granularity of your test. So if you are preparing a region-based test and your dependent variable is total revenue, you must be able to get total revenue by regions – both some historical data (that is used for selecting the test and control locations properly) and then for the test evaluation. This is often easy in principle but beware of nuances with geolocation reliability. Geo-location in modern platforms appears to be quite precise, there are underlying  issues that may need to be addressed – both technical (eg ISP IP allocation practices, mobile carriers routing traffic through major cities regardless of the user’s actual location, dynamic IP addressing etc.) or actual such as intra-day migration (people moving across different locations – commuting to work etc) – these generally require careful test design. There are several good open source frameworks that can help you with designing and evaluating your geo-tests:

An end-2-end real world example might look like this:

1) You, as the Head of Digital Advertising, want to see what is the true ROAS of your TikTok ads – as you see very conflicting numbers in TikTok reports and Google Analytics 4. Thus our hypothesis could be formulated as: Given our current spend level, the true TikTok ads ROAS is at least 3.

2) After discussing with your analyst you have settled on the following high level test/geo-experiment design:

    • You will select a group of regions where you will turn TikTok Ads off for some time and you will measure how much the total revenue in those regions decreases to quantify the incremental value of TT ads
    • Your constraints for the test are: you are not willing to risk much revenue so regions representing at most 15% of your revenue can be tested (ie TT turned off). And as TT is still a smaller advertising channel for you, you are not really risking 15% of total revenue – if TT Ads drove let’s say 10% of your total revenue (right now you don’t know – that’s why you need the test, but you can still make some educated guess about what is realistic), you would be risking 10% of revenue in regions representing 15% of your total revenue, ie your revenue risk would be 1.5% for a few weeks…that is acceptable for you

3) You check with relevant people in the company to see if there is anything that would render some regions unsuitable or could contaminate the test (eg a local PR event, local promo with a retailer etc.).

4) Your analyst will now prepare a so-called power analysis – the analyst will basically prepare various test scenarios with various parameters – which regions should be in test/control, which should be completely excluded from the test, how long the test should be running, what is the likelihood of achieving statistically significant results etc.

5) After going through the variants with your analyst, you select the final test design – the test will run for 4 weeks in 7 regions.

  1.  

6) You inform relevant people in your company about the upcoming test to make sure that there are no surprises and your team prepares the test execution.

7) The test is running.

8) Your analyst prepares the test evaluation: you get a report showing the lost revenue and the calculated ROAS of TT Ads, confidence intervals of the results, visualizations etc and you discuss it with your analyst.

9) Based on the result you decide that it makes sense to continue with TT Ads and in fact it seems the ROAS is higher than you expected so for now you continue with the pre-test spend levels but you already want to test another hypothesis: what happens if we increase the spend on TT Ads by 30%?

Experiments Best Practices

Always-On Experimentation Culture

Top-performing advertisers run continuous experiments to optimize their strategies over time. Google’s research suggests that advertisers conducting at least 15 experiments per year see a 30% increase in ad performance in the same year and a 45% boost the following year (https://services.google.com/fh/files/misc/2022_experiment_with_google_ads_playbook.pdf).

Leading companies consistently outperform their competitors because they embrace a test-and-learn mindset. Instead of making decisions based on assumptions, they leverage rigorous experimentation to guide their marketing investments. Studies show that companies running frequent experiments achieve higher conversion rates, stronger customer retention, and better ad efficiency.

Calibrating MMM and MTA with Experiment Data

Marketers should use incrementality tests to validate and refine their MMM and MTA models. This approach ensures that data-driven decisions are grounded in real-world causality rather than flawed attribution assumptions.

Pay attention to good execution and have realistic expectations

Do try to design and execute tests properly, make sure you have selected the right test and control groups and try to minimize factors that could contaminate the testing process (we understand this is easier said than done in practice – the reality tends to be messy). At the same time be prepared that despite your best efforts some tests will yield no definitive conclusions – don’t be discouraged by it. You should judge the testing program on its overall results eg after a year, not after a single test. Industry research and studies consistently show that companies that commit to consistent experimentation tend to outperform those that do not.

Don’t be afraid to test big

Try to test big and potentially highly impactful changes. Just focusing on lots of “small things” may feel more comfortable and safer but in our experience mostly leads to failure – the organisation spends a lot of time and effort on testing things that don’t really have any business impact.

Integrate testing into decision making

A common issue: a test is designed, executed, evaluated…and nothing happens, the test result remains as a one-off insight with no real impact. There can be multiple reasons for this – organizational inertia (test shows we should turn off X, but in our management reporting based on last click data X looks great and this is what the “insert Sr Exec role” is used to…), internal politics, risk-averse company culture and Hippo effects etc. This is a common pain and a moment where the company and its leadership shows whether it actually values data-driven culture or just wants “analytics” to confirm their preexisting assumptions and biases.

Recommended reading and resources

Share Knowledge