What data for a relevant sales forecast?

by Rupert Schiessl

#machine learning #data

In order to generate models capable of accurately predicting future sales, machine learning must rely on multiple variables. Here we provide an overview of the different types of variables used, as well as the types of data from which they can be derived.

  1. Types of variables used
  2. The most frequently used data
  3. Scalable databases

Types of variables used

Endogenous Vs exogenous

So-called "endogenous" variables are directly related to the flow to be predicted. They can be the price of a product, the location of a point of sale, the number of salespeople, etc. These elements are generally known by the organization making the predictions and it has this information in its databases. The main challenge regarding these endogenous factors is to extract them in a targeted manner. Indeed, even in machine learning approaches, capable of understanding the relative importance of a variable in relation to a set objective, knowledge of the business and the right selection of endogenous variables remains the main key to good accuracy.

Exogenous variables are data that are external to the forecasting system. They are for example the weather, the road traffic or the competitive density around a point of sale. These data, if they exist, are not found in the company's databases. It is therefore necessary to look for them elsewhere in order to integrate them, sometimes in a second step if they are too complicated to obtain, into predictive models. Here again, there may be a problem related to the relevance of some of these data, depending on their capacity to really improve prediction or to mislead the modeling, and therefore to generate "noise".



Known or Unknown

A part of the variables that impact the flow to be predicted is known by the company. These data can then be identified and, if available, used in the modeling (e.g. product price, promotion mechanism, time of purchase, etc.). The flows to be predicted are also impacted by unknown data, which have an impact on the final forecast that the model will deliver. For a given purchase, for example, a large part of the customer's behavior depends on his psychology or his desire of the day.

Controlled vs. suffered

In all the data collected, only a small part of the variables are under control. For example, a sales outlet will control the price of its articles, the locations on the shelves or the qualification of its sales force. Other data, on the other hand, are undergone by the forecaster, who cannot influence the importance of each of them. These are elements such as the weather, local events, household purchasing power or legislation.

In an ideal world, it would obviously be desirable to know and control all the factors that have an impact on a forecast, because the more unknown or uncontrolled they are, the higher the variance, which increases the gap between forecast and actual flows.

The most frequently used data

Some data is essential to make a sales prediction. Among these, we can notably find :

      • Seasonality (season, day of the week, day in the month (impact of pay that falls at the end of the month), school vacations, sales, events, etc.).
      • Product (category, brand, packaging, etc.)
      • Price (selling price, discounts, historical price evolution, promotional mechanism, etc.).
      • Sales force (remuneration, qualification, number, etc.)
      • Point of sale (location, size, assortment, local events, etc.)
      • Competition (nature, density, overlapping offers, etc.)
      • Channel (physical, web, drive, delivery, parcel relay)
      • Promotions (advertising budget, merchandising, social networks, catalogs, etc.)
      • Weather
      • Macroeconomic indicators (exchange rate, average wage, inflation rate, stock market prices, etc.)

Note that other variables may appear temporarily in particular contexts. The health crisis linked to COVID 19, for example, has shown us that it was necessary to incorporate new variables into the model:

      • Regional deconfinement
      • Virus circulation and level of contamination
      • Opening of schools and school restaurants
      • Displacement constraints
      • Opening of the borders
      • Availability of masks
      • New habits (e.g. the trend of homemade bread)

Scalable databases

In conclusion, it should be noted that the importance of data may change over time. A forecasting system is never static, but must regularly take into account new data. Typically, this data appears when a new store opens, a product launch or simply new sales. Other data may be provided when open data is made available (e.g. publication of information related to train use by the SNCF or RATP).

It is also important to take into account that specific variables can allow the algorithms to understand the particular aspect of an entire period. Again, one can mention here binary classifications used to indicate to the algorithms a special economic situation, such as during a pandemic like COVID19. In the same way, some variables, initially considered as noise, may gain importance over time, such as the trend of Bio or "Made in France".

Finally, be careful to differentiate correlation, causality and coincidence. Not all correlations are causality and vice versa. And if you have a few minutes left, here are some examples of the most beautiful non-correlations: https://www.tylervigen.com/spu...

To know more about using the learning machine for sales forecasting? Ask for a demo of our platform now.

Be informed of the latest news

Receive our latest news directly in your mailbox

Similar posts