In today’s digital marketing landscape, businesses face a paradox: they have more data than ever, yet measuring the true impact of marketing efforts remains a challenge. Traditional analytics often mislead marketers into believing correlation implies causation, leading to ineffective budget allocation and suboptimal marketing strategies. Marketing experiments provide a solution—offering a rigorous way to determine what truly drives business growth.
In digital marketing, one of the biggest challenges is understanding whether a campaign is truly driving additional conversions or if those conversions would have happened regardless. Incrementality measurement helps solve this problem by determining the true lift of a marketing effort beyond what would have naturally occurred.
Ie. incrementality is the additional value generated by a marketing activity. Unlike correlation, which shows a relationship between variables, incrementality measures causation—whether an action (e.g., running an ad) directly influenced an outcome (e.g., a purchase).
For example, if a brand runs a paid search campaign and sees a spike in sales, it may assume the campaign caused the increase. However, without a proper control group, it is impossible to tell whether those customers would have purchased anyway. Incrementality testing solves this by comparing a test group (exposed to the campaign) with a control group (not exposed) and measuring the difference.
Many marketers still rely only on attribution models – while sophisticated data driven attribution models are useful they can still struggle with identifying true incremental impact of marketing activities (not to mention the difficulties in obtaining relevant user-level data for upper funnel activities (eg user-level data for Youtube ad views) or even offline activities (TV advertisements) or offline sales that tend to be completely ignored by attribution models.
Unlike traditional methods, experiments enable marketers to determine incrementality—the actual lift a campaign provides beyond organic performance. Experiments answer crucial questions like:
Companies that embrace continuous experimentation see 30-45% better ad performance compared to those that don’t. *
Despite its benefits, many firms still underutilize experiments. Research by Meta indicates that fewer than 25% of advertisers run controlled tests regularly. Stated reasons and barriers include:
On the other hand firms that successfully integrate experimentation into their marketing strategy experience achieve:
Marketing experiments are essential for understanding causal relationships—helping marketers distinguish between correlation and true impact. Unlike traditional measurement techniques, experiments provide scientific validation for marketing decisions. In the following paragraphs we lay out the fundamental principles of designing and running effective marketing experiments, ensuring actionable insights and reliable results.
A marketing experiment is a structured approach to testing a hypothesis by isolating specific variables and measuring their direct impact. At its core, an experiment consists of:
By comparing these groups, marketers can determine whether the intervention (eg new channel or campaign or a change in budget) drove real, incremental change
Marketing experiments should follow this process to ensure rigor and reliability:
A fashion retailer runs a Facebook campaign and wants to measure its impact. They divide their audience randomly into two groups—one sees the ad (test group), and the other doesn’t (control group). After four weeks, they compare conversion rates. If the test group significantly outperforms the control, they can attribute the lift to the campaign.
A fintech app company wants to understand whether their video ads drive new acquisitions. They split US states into 2 groups - in one group they run the ads, in the other they don’t. Then they can measure the uplift in the states where the campaign ran.
Understanding core statistical principles is crucial for designing sound experiments:
Just because two things happen together (correlation) doesn’t mean one caused the other (causation). A classic example is ice cream sales and drowning incidents – both tend to increase in the summer, but one doesn’t cause the other. They’re both influenced by a third factor: warm weather.
This is the cornerstone of a well-designed experiment. Randomly assigning participants or units to either the test group (exposed to the change) or the control group (not exposed) helps ensure that any differences observed between the groups are likely due to the change being tested, and not pre-existing differences between the groups. Randomization balances out potential confounding variables (known and unknown) across the groups, making the comparison as fair as possible. Different randomization techniques exist (simple, stratified, block), each with its own advantages depending on the experimental setup.
This helps us determine if the observed difference between the test and control groups is likely a real effect or simply due to random chance. It’s typically expressed as a p-value. A low p-value (e.g., less than 0.05) indicates that the observed result is statistically significant, meaning it’s unlikely to have occurred by chance alone. It does not tell you the size or practical importance of the effect. Statistical significance is often misinterpreted. It’s crucial to remember that it doesn’t guarantee a practically meaningful result.
A confidence interval provides a range of values within which the true population effect is likely to fall, with a certain level of confidence (e.g., 95%). For example, a 95% confidence interval for a conversion rate lift might be [2%, 5%]. This means we are 95% confident that the true lift in conversion rate due to the change is somewhere between 2% and 5%. Confidence intervals provide more information than just a p-value by giving a sense of the magnitude of the effect and its uncertainty. A wider interval indicates more uncertainty about the true effect.
This is a crucial step before running an experiment. It determines the minimum sample size needed to detect a statistically significant effect of a given size, with a desired level of confidence (power). Key inputs to a power analysis include:
The output of a power analysis helps determine the required sample size, which translates to test duration and budget. It prevents running underpowered experiments that are unlikely to detect meaningful effects, even if they exist, and also helps avoid running overly long and expensive experiments when a smaller sample size would suffice.
As a business user you don’t need to have rigorous and comprehensive knowledge of these topics – having basic intuitive understanding of the key concepts is enough. On the other hand, those who actually design and analyze the experiments should be well-versed in the relevant statistical methods and experiment design – otherwise you risk the conclusions drawn may be incorrect.
Different experiments suit different marketing goals and use cases. Common types include:
Compares two versions of an asset (ad, creative,..), offer type or budget levels to measure performance differences. There are multiple ways to construct the test/control groups – eg based on users or based on geographical locations.
Tests multiple variables simultaneously to identify the best combination.
Test and control groups are based on individual users, this method is often employed by in-platform solutions for testing (Google, Meta). They rely on having access to individual user data and identification (eg being able to identify the real user across multiple devices). For certain cases they are the best solution, for many others they are unusable in practice.
A method of constructing test and control groups by using different geo locations. For illustration you can imagine running a campaign in certain US states and not running it in others – this would be a geo-experiment. Geo-experiments are probably the most frequently used type of experiment in measuring marketing effectiveness.
Here there is no “real” control group – sometimes it is not possible to have one (e.g. when a media type only allows nation-wide targeting – such as TV in many countries). In this case there are ways to construct “artificial control groups” using some clever statistics that approximate what would happen without the intervention (campaign). This type of test is the least reliable but sometimes there is no other option.
In digital marketing, one of the biggest challenges is understanding whether a campaign is truly driving additional conversions or if those conversions would have happened regardless. Incrementality measurement helps solve this problem by determining the true lift of a marketing effort beyond what would have naturally occurred.
Ie incrementality is the additional value generated by a marketing activity. Unlike correlation, which shows a relationship between variables (attribution models are usually heavily based on correlation), incrementality measures causation—whether an action (e.g., running an ad) directly influenced an outcome (e.g., a purchase).
For example, if a brand runs a paid search campaign and Google Ads or Google Analytics 4 attribute some sales to this campaign, it may assume the campaign caused the sales. However, without a proper control group, it is impossible to tell whether those customers would have purchased anyway. Incrementality testing solves this by comparing a test group (exposed to the campaign) with a control group (not exposed) and measuring the difference.
Sophisticated marketers use all 3 major methods of marketing measurement:
However, you don’t have to start with all three at once, in our experience
Let’s take a look at the most common type of experiment you will encounter – geo experiments:
Geo experiments are great for testing the effect of a change / intervention in an advertising platform on your total results, eg:
In geo experiments the dependent variable is not biased by any attribution – instead you measure the impact on the total value, so you are measuring the impact of eg TikTok ads on your total (real) revenue, not on what TikTok thinks the revenue is (its platform attribution) or what GA4 thinks the revenue is. This independence on any attribution is a key advantage of this test type.
You need to be able to observe the dependent variable (eg total revenue, total new customers etc) on geographical level – at the granularity of your test. So if you are preparing a region-based test and your dependent variable is total revenue, you must be able to get total revenue by regions – both some historical data (that is used for selecting the test and control locations properly) and then for the test evaluation. This is often easy in principle but beware of nuances with geolocation reliability. Geo-location in modern platforms appears to be quite precise, there are underlying issues that may need to be addressed – both technical (eg ISP IP allocation practices, mobile carriers routing traffic through major cities regardless of the user’s actual location, dynamic IP addressing etc.) or actual such as intra-day migration (people moving across different locations – commuting to work etc) – these generally require careful test design. There are several good open source frameworks that can help you with designing and evaluating your geo-tests:
1) You, as the Head of Digital Advertising, want to see what is the true ROAS of your TikTok ads – as you see very conflicting numbers in TikTok reports and Google Analytics 4. Thus our hypothesis could be formulated as: Given our current spend level, the true TikTok ads ROAS is at least 3.
2) After discussing with your analyst you have settled on the following high level test/geo-experiment design:
3) You check with relevant people in the company to see if there is anything that would render some regions unsuitable or could contaminate the test (eg a local PR event, local promo with a retailer etc.).
4) Your analyst will now prepare a so-called power analysis – the analyst will basically prepare various test scenarios with various parameters – which regions should be in test/control, which should be completely excluded from the test, how long the test should be running, what is the likelihood of achieving statistically significant results etc.
5) After going through the variants with your analyst, you select the final test design – the test will run for 4 weeks in 7 regions.
6) You inform relevant people in your company about the upcoming test to make sure that there are no surprises and your team prepares the test execution.
7) The test is running.
8) Your analyst prepares the test evaluation: you get a report showing the lost revenue and the calculated ROAS of TT Ads, confidence intervals of the results, visualizations etc and you discuss it with your analyst.
9) Based on the result you decide that it makes sense to continue with TT Ads and in fact it seems the ROAS is higher than you expected so for now you continue with the pre-test spend levels but you already want to test another hypothesis: what happens if we increase the spend on TT Ads by 30%?
Top-performing advertisers run continuous experiments to optimize their strategies over time. Google’s research suggests that advertisers conducting at least 15 experiments per year see a 30% increase in ad performance in the same year and a 45% boost the following year (https://services.google.com/fh/files/misc/2022_experiment_with_google_ads_playbook.pdf).
Leading companies consistently outperform their competitors because they embrace a test-and-learn mindset. Instead of making decisions based on assumptions, they leverage rigorous experimentation to guide their marketing investments. Studies show that companies running frequent experiments achieve higher conversion rates, stronger customer retention, and better ad efficiency.
Marketers should use incrementality tests to validate and refine their MMM and MTA models. This approach ensures that data-driven decisions are grounded in real-world causality rather than flawed attribution assumptions.
Do try to design and execute tests properly, make sure you have selected the right test and control groups and try to minimize factors that could contaminate the testing process (we understand this is easier said than done in practice – the reality tends to be messy). At the same time be prepared that despite your best efforts some tests will yield no definitive conclusions – don’t be discouraged by it. You should judge the testing program on its overall results eg after a year, not after a single test. Industry research and studies consistently show that companies that commit to consistent experimentation tend to outperform those that do not.
Try to test big and potentially highly impactful changes. Just focusing on lots of “small things” may feel more comfortable and safer but in our experience mostly leads to failure – the organisation spends a lot of time and effort on testing things that don’t really have any business impact.
A common issue: a test is designed, executed, evaluated…and nothing happens, the test result remains as a one-off insight with no real impact. There can be multiple reasons for this – organizational inertia (test shows we should turn off X, but in our management reporting based on last click data X looks great and this is what the “insert Sr Exec role” is used to…), internal politics, risk-averse company culture and Hippo effects etc. This is a common pain and a moment where the company and its leadership shows whether it actually values data-driven culture or just wants “analytics” to confirm their preexisting assumptions and biases.