**Gaining insights from data**

**Media decisions are increasingly made on the basis of data. Econometric modelling is one way to identify the drivers for greater effectiveness and efficiency. However, in practice, these projects pose challenges for companies.**

Today, a company has more data at its disposal than ever before. Every act of purchase is recorded via scanner cash registers, every contact with an advertising medium on the internet is measured. In addition, there is data that provides information about prices as well as marketing activities such as promotions and advertising. In addition, there are other data that provide information on the development of the industry, the economy or seasonal factors. All this data can contain crucial information that makes it possible to draw conclusions about consumer behaviour, to simulate behaviour in the event of changes in framework parameters or to forecast the development of sales.

A procedure that allows influences from both the online and offline world to be taken into account is econometric modelling. Econometric modelling is a collective term that can cover several, usually statistical, procedures. The statistical procedure that is used in most cases is multiple regression, in which more than one independent variable can be taken into account in a model. In this way, for example, cause-effect relationships can be represented and quantified, and the expression of the dependent variable can be predicted. Using econometric modelling, the effect of the most important factors from marketing and media can be determined. This makes it possible to answer many questions of practical relevance to companies, for example:

- Which factors influence sales?
- What effect does the price have?
- In which media should advertising take place?
- What advertising pressure is required?

**How is such a data model created?**

In contrast to analyses in the purely digital field, econometric modelling typically does not use data at the person level or for individual purchase acts, but aggregated data. These are considered over time. Depending on the product and data availability, they are considered over several years. In practice, three years of data are often used at the weekly level, which corresponds to 156 data points (weeks). In the case of a sales data model, data must be gathered from different sources. This includes marketing data such as sales, turnover, price and information on distribution and promotions, data on media use, seasonal data such as holidays, public holidays, weather data and – depending on the case – specific additional data such as data on new product launches, relaunches or even delivery problems.

The data should – as far as available – be considered for the respective product and for the relevant competition in order to also be able to represent corresponding competitive effects (e.g. the effects of a competitor’s advertising on the sales of the product under consideration). In order to be able to realistically consider the effect of the individual factors, transformations must be carried out for some variables, for example temporal transformations of the media variables: While a price promotion has an immediate effect (if the product is bought because of the price reduction, it is bought at the time of the price reduction and not later), the direct temporal assignment is not given for the effect of the media variables – here a purchase triggered by advertising can still take place days and weeks after the advertising contact. In order to take this carry-over into account, the media variables are divided into several individual variables that have different time courses of effect. Furthermore, transformations are necessary in order to map marginal utility curves (a doubling of advertising contacts usually results in less than double the sales effect). Since ideally competition data are also added, more than 2,000 variables are examined in a typical project.

**Data: Variety of possible combinations**

Now, one could initially think of simply using all available data and forming a corresponding model from it. Unfortunately, this is not possible. For statistical reasons, the number of explanatory variables used cannot exceed the number of data points – so with weekly data for three years, as mentioned, a maximum of 156 variables would be possible. But even the maximum possible variables are already far too many for a robust model: while the model quality increases with each additional variable included, problems almost inevitably arise. Each individual variable has to be statistically significant, i.e. it has to make a contribution that is certainly different from zero, otherwise its effect – but also that of the other variables – can be wrongly estimated. With weekly data for three years, the number of simultaneously significant variables is limited to a maximum of 20.

In practical terms, the question is how to select the 20 most important variables from the large number of variables that form a model with high explanatory power in a statistical sense, but are also plausible in a practical sense. With 2,000 variables, there would be about 4×1047 (400 septillions) possibilities to select 20 variables each. The possibilities of building models with fewer than 20 variables are added to this. No computer today is capable of running through all the possible combinations in a reasonable amount of time.

**Verification of the model**

In practical use, it is therefore necessary to apply a method that arrives at a reasonable selection of variables despite the impossibility of testing all variants – and that is both safe in a statistical sense and seems plausible in a practical sense. There is no standard way to arrive at such a result. A tried and tested approach is to determine the most important factors (e.g. price, season) on the basis of hypotheses and then enrich the model step by step with further variables until no further noticeable improvement can be achieved. However, this procedure is not a one-way street: Here again, it is advisable to test as many variants as possible.

It is also often the case that previously significant influences have to be removed again when other – more important in terms of content – influences are included in the model. In order to arrive at models that make sense in terms of content, the effect and strength of the influences should be checked for plausibility; experience also plays an important role here. The hardest check of a model, however, is to let it make a forecast – first for the past, then in ongoing operation on the basis of real data. To check the forecasting power of a model, a part of the time series is cut out and forecasts are made on the basis of the model that was created again for the remaining shortened time series (holdout test). The holdout test is done by having the model make a forecast using new values now available for the variables used and comparing the forecast to the real values. It is to be expected that the forecast and real values will increasingly differ after some time. This is the right time to update the model.

**Practical application possibilities**

The model has many possible applications. These can be divided into two areas: Retrospective Description and Forecasting. In the retrospective view, for example, it is possible to determine what share the proven factors have in the development of sales (sales decomposition). The effect of individual factors can be determined, for example the elasticity of the price. In the area of media, the main interest is in how long the effect lasts, which advertising pressure is optimal and what contribution to sales the previous media have made (ROI). The second important area of application for the models is the possibility to carry out simulations and forecasts. For example, it is possible to simulate the effect of a price adjustment or a change in the media mix and media strategy. Based on the model, the optimal media strategy as well as the optimal media mix for a given budget can be determined or, conversely, the budget required to achieve certain sales targets can be determined. Finally, forecasts can be made for the development of sales. Today, many companies in different industries use models to simulate the effects of different marketing strategies and to support strategic decisions based on data.

**Statistical and methodological standards**

While there are various statistical indicators for the procedures behind the modelling that allow, for example, the quality of the model and the security of the data used to be assessed, there is no uniform methodological standard for the way in which further central results are derived on the basis of the model. Even if two different service providers use the same statistical data model, there may be differences in the practical preparation. For example, there is no standard for determining sales decomposition, to name just one example. A problem arises in how to deal with negative influences, such as price. For example, the simplest way would be to offset the coefficients obtained through modelling with the respective variable values, which leads to positive and negative shares for coefficients with positive and negative signs. Another way is to start from a reference point for each variable. The value that results from these reference points forms the baseline (comparable to the sales share that still results under the most unfavorable conditions). Only those values of the variables that deviate from this point are then included in the sales decomposition. This approach leads to all shares becoming positive, even those of the variables that originally had a negative effect. Both approaches are comprehensible but lead to different proportions and thus to different statements. Further central statements of the model concern the ROI or also the exact effect of the communication variables. However, it would go beyond the scope of this article to describe the different approaches for calculating these effects.

**Engaging service providers**

The data model should help reduce the complexity of the real world and support marketing decisions in a data-based way. However, to minimise the risk of making a wrong decision, it makes sense to take a closer look at the respective model. Since there are no uniform standards regarding the presentation of the methods, it is all the more important that the respective service provider discloses the statistical key figures of the model, but also the way in which the results were arrived at, to the client. This is the only way to verify the results and to be able to classify and interpret them correctly.