- Each time series is different
- Time independent data
- So, how do I pick the right model?
Each time series is different
Unfortunately, when you are confronted with real-world data, you quickly realize that you never face twice the same kind of data, especially when you are dealing with time-series.
This implies that there is no single solution for building accurate models giving precise predictions.
However, there are a few characteristics that can easily be identified and exploited to help you converge faster to the right model.
When people think of time-series, they usually take for granted that these are smooth, continuous data. Let’s face the truth: life is not that simple.
Take for instance data collected on wind turbines. Some turbines do no record data when there is no wind, to save power. Or as this is usually the case with wind turbines, they are located in low populated areas where the connection to the internet isn’t stable. Connection losses are common, leading to gaps in data.
Dealing with non-continuous data can not be done with the same kind of models than continuous ones. ARIMA, SARIMA, Exponential Smooth might not work. Gradient Boosting methods like CatBoost are good alternatives in these cases.
Time independent data
Another aspect to consider when dealing with time-series is that it is generally required to add exogenous, time-independent data in the model to achieve a high level of precision.
The question is then how to integrate those time-independent data into a time-based model? The short answer is that it’s not always necessary to relies on a time-based model to achieve a good precision level. In these cases, a good preprocessing to extract time-related features and regression algorithms like SVM can give good results.
Another obvious point to account for is the amount of data you are dealing with. You won’t be able to train LSTM neural networks with only a few hundred data points. Neural Networks need much more data.
Additionally, you’ll have to consider the stationarity of your data: i.e. is the average and the deviation of the quantity to predict changing with time? If it’s the case, it might be necessary to preprocess your data, usually using time differentiation to ensure that it is the case.
So, how do I pick the right model?
You can use the simple diagram below. It relies on our experience on the subject and summarizes the selection of the right model depending on the nature of your data:
This is a good starting point to get a decent model. Going further, and reaching higher levels of precision will require a much more complex approach, mixing various models trained on previously built clusters. Or use large neural networks, architectured to automatically identify these clusters, and engineer the best features.
By combining all these steps in a single, global optimization problem, you will no longer have to choose for a model.