Introduction to Sales forecasting Engine

Time-series Sales forecasting is one of the most important topics in every business, helping to process data taken over a long period of time. Stock exchange, logistics, retail are classic industries where the ability to build predictive models becomes a crucial differentiator in a highly-competitive business environment. In our article we’ll try to explain how time series analysis and sales forecasting methods can be used for typical business tasks: finding hidden patterns, detecting trends in sales over the years, and predicting sales in the future.

Background Analysis

There are many time series forecasting models out there – from well-known and thoroughly studied, such as Naïve and Exponential Smoothing to the new and hot ones, like Facebook Prophet and LSTM models.

In the Naïve model, the forecasts for every horizon correspond to the last observed value and assume that the stochastic model generating the time series is a random walk. An extension of the Naïve model is called SNaïve (Seasonal Naïve), and this one assumes that the time series has a seasonal component.

Facebook Prophet is open-source software released by Facebook’s Core Data Science team. This model assumes that time series can be decomposed by the trend, seasonality, holiday, or error components.

LSTM stands for Long-Short Term Memories which come from the Neural Network domain. They are mostly used with unstructured data (e.g. audio, text, video), but may benefit from the transfer of learning techniques when applied to standard time series.

In this article, we would like to share with you our experience of using the SARIMA model. It works well for the car sales because of the sales seasonality and relative simplicity of this domain.

General Data Description 

The dataset contains the number of units and total sales of new motor vehicles, grouped by the vehicle type and the manufacturing location in Canada from 1946 to 2019, with data sampled on a monthly basis.

It is accessible here: Data Set

FILE = './data_input/20100001.csv' # pass to the downloaded csv file
df = pd.read_csv(FILE, quoting=csv.QUOTE_ALL, parse_dates=['REF_DATE'])

The initial data structure looks in the following way: 

print(df.columns)
Index(['REF_DATE', 'GEO', 'DGUID', 'Vehicle type', 'Origin of manufacture',
       'Sales', 'Seasonal adjustment', 'UOM', 'UOM_ID', 'SCALAR_FACTOR',
       'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS', 'SYMBOL',
       'TERMINATED', 'DECIMALS'],
      dtype='object')

There are 18 columns: 1 DateTime format, 5 numeric, 13 categorical.

print(df.dtypes)

REF_DATE                 datetime64[ns]
GEO                              object
DGUID                            object
Vehicle type                     object
Origin of manufacture            object
Sales                            object
Seasonal adjustment              object
UOM                              object
UOM_ID                            int64
SCALAR_FACTOR                    object
SCALAR_ID                         int64
VECTOR                           object
COORDINATE                       object
VALUE                           float64
STATUS                           object
SYMBOL                          float64
TERMINATED                       object
DECIMALS                          int64
dtype: object
The raw version of our dataset consists of 148’890 observations, covering the period between 1946 and 2019.

print(df.shape)
(148890, 18)

After filtering the data by: Country, Vehicle type, Manufacturing Location, Seasonal Adjustment, and then dropping all columns, except for the two: REF_DATE and VALUE, we receive the final version of our dataset. It contains 888 rows, 1 numerical column and an index in the DateTime format:

condition_filter_1 = df['GEO'] == "Canada"
condition_filter_2 = df['Vehicle type'] == "Total, new motor vehicles"
condition_filter_3 = df['Origin of manufacture'] == "Total, country of manufacture"
condition_filter_4 = df['Sales'] == "Units"
condition_filter_5 = df['Seasonal adjustment'] == "Unadjusted"

df = df.loc[condition_filter_1 & 
                condition_filter_2 & 
                condition_filter_3 & 
                condition_filter_4 &
                Condition_filter_5]

df.shape
(888, 1)
...
df.head()
REF_DATE     VALUE
1946-01-01 2756.0
1946-02-01 3763.0
1946-03-01 7200.0
1946-04-01 9645.0
1946-05-01 12032.0

Our time series is univariate, which means that:

  • We are considering only the relationship between the y-axis value and the x-axis time points. 
  • We don’t take into account any outside factors that may be affecting the time series.

Data Analysis

Time Domain Analysis

By means of a simple pandas command, we got the overall statistics for new cars sales in Canada between 1946 and 2019:

print(df.describe())
             VALUE

count        888.0
mean         96643.0
std          47435.0
min          2756.0
25%          57266.0
50%          97259.0
75%          131516.0
max          220436.0

We quickly plot the data out with the matplotlib package, where y-axis stands for Sales, x-axis stands for years and then we’re able to make some conclusions:

  • There is an upward trend over the years
New cars Sales in Canada 1946-2019 (in units)
Fig. 1. New cars Sales in Canada 1946-2019 (in units)
  • Mostly, there is a slight upward trend within any single year.
  • June and August show peaks in sales.
  • November and December are the low seasons.
New cars Sales in Canada 2013-2019 (in units)
New cars Sales in Canada 2013-2019

Stationarity Testing Using the Augmented Dickey-Fuller Unit Root Test

This is a statistical test that we run to determine if the time series is stationary or not.
The null hypothesis of the ADF test is that the time series is non-stationary. In other words, we have two options:

  1. P-value is more than the significance level (0.05), which means that there is no sufficient statistical evidence to reject the null hypothesis.
  2. here is sufficient statistical evidence to reject the null hypothesis, when p-value of the test is less than the significance level (0.05).

In our case, p-value = 0.6814 (which is more than 0.05), which means that we failed to reject the non-stationarity hypothesis.

Frequency (Seasonality) Analysis

We performed the seasonality analysis in order to find out the prominent frequencies (periodicity) in the frequency domain using the Fourier Transform for full-time series and the alternating component of the series. 

Here is the Energy Spectral Density (ESD) plot for new cars sales in Canada:

def get_esd(ts: pd.core.series.Series, fs: float) -> tuple:

    T = 1 / fs
    N = len(ts)
    F = 1 / (N * T)
    b = (np.abs(T * np.fft.fft(ts))) ** 2
    c = b[0:N//2+1]
    esd = c * 2
    f = np.array(range(int(N/2+1)))
    f = f * F
    return (esd, f) 
    timeseries_2 = timeseries - timeseries.mean()
    esd_2, f = get_esd(timeseries_2, 1.0)
    plt.figure(3)
    plt.plot(f, esd_2)
    plt.title('Energy Spectral Density')
    plt.xlabel('Frequency (1/month)')
    plt.ylabel('ESD (sales**2 * month / (1/month))')
    plt.grid()

Energy Spectral Density

The maximum seasonality effects are at frequencies:
1, 2, 74 (1/month)

Periodicity:
96, 48, 1 (months)

Statistical Properties Analysis

Then, we conducted a statistical analysis to understand the probability distribution of data as well as to prove the normality of distributions. The histogram and the qq-plot of data examples are shown below.

Histogram of sales values

QQ-plot for New cars sales in Canada

Normality Testing Using the D’Agostino and Pearson’s Test

This test helped us understand how far our data distribution is from the normal one by computing the p-value in hypothesis testing:

  • null hypothesis – data are from the normal distribution
  • alternative hypothesis – data are not from the normal distribution (p-value < 0.05)
from scipy.stats import normaltest
ALPHA = .05
stat, p = normaltest(timeseries)
print('p-value = {0}'.format(p))
if p > ALPHA:                         # null hypothesis: x comes from a normal distribution
   print('Sample comes from NORMAL distribution')
else:                   
    print('Sample doesn’t come from NORMAL distribution')

p-value = 4.1868e-18
Sample doesn’t come from NORMAL distribution

Judging by the result of our p-value, we could draw the following conclusion – we have some strong enough evidence that the data distribution is not normal.

Method and Model Description

There is an upward trend, seasonality and some noise, so we can use the python package called  <statsmodels> to perform a decomposition of this time series. 

The decomposition of a time series is a statistical task that deconstructs the time series into several components, each representing one of the underlying pattern categories.

Our goal is to build a mathematical model to be able to forecast the future. 

Next, we decomposed the observed data (SEASONAL, TREND, and NOISE) with the following code:

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(timeseries, model='additive')
result.plot()

As a result, we have  split the data into 4 components:

Decomposition of time series: trend, seasonality, and noise.

From the plot above two things come to light: 

  • the seasonal component of the data 
  • the obvious upward trend in the data

SARIMA is actually a bunch of simpler models combined into one complex model that can represent the time series by means of exhibiting its non-stationary properties and seasonality.

In theory, we have to find a set of coefficients p, d, q, P, D, and Q, which give us the lowest Akaike information criterion (AIC).

Moreover, for most years, the trend is rather linear while the seasonality and trend components are constant over time – so we choose an ADDITIVE model.

The first way of selecting this set of parameters is to take a look at the Autocorrelation, Partial Autocorrelation plots (ACF, PACF).

First, we have an autoregressive model AR(p). This is basically a regression of the time series onto itself. Here, we assume that the current value depends on its previous values with some lag. It takes a parameter p which represents the maximum lag. To find it, we look at the partial autocorrelation plot and identify the lag after which most lags are not significant.

In the example below, p would be 1.

Partial Autocorrelation Plot (PACF)

Autocorrelation Plot (ACF

After that, we add the order of integration I(d). The parameter d represents the number of differences required to make the series stationary. By means of differencing our time series up to the 2nd order, we see that the 1st order gave us stationary data:

timeseries_diff_1 = timeseries.diff().dropna()
timeseries_diff_2 = timeseries_diff_1.dropna()
p_value = adfuller(timeseries)[1].round(4)
print('Original Series: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")
p_value = adfuller(timeseries_diff_1)[1].round(4)
print('1st Order Differencing: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")
p_value = adfuller(timeseries_diff_2)[1].round(4)
print('2st Order Differencing: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")

Results

Finally, we add the component: seasonality S(P, D, Q, s), where s is simply the length of a season. This component requires the parameters P and Q which are the same as p and q, but used for the seasonal component. Finally, D is the order of seasonal integration representing the number of differences required to remove seasonality from the series.

Combining these all, we get the SARIMA(p, d, q)(P, D, Q, s), model.

The main takeaway is: before modeling with SARIMA, we must apply transformations to our time series to remove seasonality and any non-stationary behaviors.

Another method (less time-consuming) is to perform a grid search over multiple values of p, d, q, P, D, and Q using some sort of performance criteria. By default, the AIC gives us a numerical value of the quality of each model, relative to each of the other ones. To be precise – the lower our AIC is, the better the model.

The pyramid-arima contains an auto_arima function that allows setting a range of p, d, q, P, D, and Q values and then fitting the models for all possible combinations. To get a summary of the best model we can use the following line of code:

# Fit your model
model_arima = pm.auto_arima(train,  max_p=4, max_q=4,
                            seasonal=True, m=12, trend='c')
model_arima.summary()

Sarimax results

Training 

The dataset was split into training and test sets, and then we trained the model by using the training data

def train_test_split(ts: pd.core.series.Series, test_size: int) -> tuple:
    train_size = len(ts) - test_size
    return (ts[:train_size], ts[train_size:])
# splitting the data into train / test 
train, test = train_test_split(ts=timeseries, test_size=TEST_PERIOD*12)
model = sm.tsa.statespace.SARIMAX(train,
                                order=(1, 1, 0),
                                seasonal_order=(2, 0, 1, 12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = model.fit()
print(results.summary())

Semirax

The best model parameters (SARIMAX(1, 1, 0)x(2, 0, 1, 12)) give us the lowest AIC:
AIC = 14957.224. 

Prediction and Evaluation

After our model has been fitted to the training data, we can try to forecast the future.

To get predictions about the future sales we can use the get_prediction() method of the model

pred = results.get_prediction(start=len(timeseries)-len(test), end=len(timeseries)-1, dynamic=True, full_results=True)

pred_ci = pred.conf_int(alpha=0.05)
pred_ci_int = pred_ci.astype(int)

forecasts = pred.predicted_mean
forecasts_int = forecasts.astype(int)

test_int = test.astype(int)

error = round((np.abs(test_int - forecasts_int) / test_int) * 100, 1)

pred_ci_int.loc[pred_ci_int['lower VALUE'] <= 0, 'lower VALUE'] = 0
pred_ci_int.loc[pred_ci_int['upper VALUE'] <= 0, 'upper VALUE'] = 0

result = pd.concat([test_int, forecasts_int, error, pred_ci_int], axis=1)

result.rename(columns={'VALUE': 'Ground Truth', 
                       0: 'Forecast', 1: "Relative Error %",
                       'lower VALUE': 'Confidence lower', 
                       'upper VALUE': 'Confidence upper'}, 
                      inplace=True)

Here we plot our forecasted values to view how well our predictions match up with the test set with the real data:

 New cars sales - train & test (on the test period)

We can also just compare this to the entire data set to see the general context of our prediction

New cars sales - train & test

Let’s use the prediction to forecast the nearest 12 months (in our case – the whole year 2020):

forecast_months = 12   # number of forecast months
fcast = results.predict(len(timeseries),
                                      len(timeseries)+forecast_months,
                                      typ='levels').rename('SARIMAX Forecast')

SARIMAX Forecast for 12 months

This is how the 12 months of 2020 are forecasted to look like:

fcast.astype('int')

Semirax forecast

In the forecasted data we see the same patterns as in the previous years:

  • The worst months for sales are the first and the last months of the year, which totally makes sense in real life (Who thinks about buying cars before and after Christmas, right?)
  • There is an upward trend with May being the first peak. Then, the trend goes down, with July being the bottom point.
  • The second peak is in August, and the next downward trend culminates in December.

The last 20% of our data was used as a test set, on which we tested our model 

by standard metrics: MAE, MSE, RMSE, MPE.

Results:

def mean_percentage_error(y_true, y_pred):
    return 100*np.mean((y_true - y_pred) / y_true)

print('MAE: \t{0:.1f}'.format(mean_absolute_error(test, forecasts)))
print('MSE: \t{0:.1f}'.format(mean_squared_error(test, forecasts)))
print('RMSE: \t{0:.1f}'.format(np.sqrt(mean_squared_error(test, forecasts))))
print('MAPE: \t{0:3.4f} %'.format(mean_percentage_error(test, forecasts)))

result

percentage_error = (100*(test - forecasts) / test).round(4)

forecast_df = pd.concat([round(test,0), round(forecasts,0), percentage_error], axis=1)
forecast_df.columns = ['Ground_truth','Prediction', 'Percentage_error']
forecast_df = forecast_df.loc['2019-01-01':].reset_index().rename(columns={'index': 'Month'})

print(forecast_df)
print('\n Maximum Percentage_error: ', np.abs(forecast_df['Percentage_error']).max(), '%')

Maxium Parcentage_error

Discussion

The overall purpose of the study was to prove that it’s possible to efficiently forecast car sales using a simple statistical model. During our research we were able to prove that the SARIMA-based approach has acceptable outcomes. Such models can be easily implemented with various statistical software and their computational complexity is acceptable. Also, the approach has well-studied statistical properties. An important advantage of the described method is the ability to adapt to new conditions and time series properties by changing the coefficients. The Jupyter notebooks we used in our experiments are open-source and can be downloaded from the GitHub repository. [Sarima_new_cars_sales]

Limitations

Our only concern about the experiments is the well-known disadvantage of the SARIMA model. It can only extract linear relationships within the time series data. Predictions generated by the SARIMA model may not be suitable for complex nonlinear cases. It does not efficiently extract the full relationship hidden in the data. Another limitation is that the SARIMA model requires a large amount of data to generate accurate predictions.

Conclusions

The accuracy of the SARIMA predictive model in the form of MPE for car sales forecast has been obtained. It has been proved that the percentage error is not greater than 19.32 % for each of the 12 months ahead. Obviously, the accuracy of the model is high enough and the model can be used as a baseline for developing better models. The method is well suited for use in different business domains.

Acknowledgments

I want to thank my colleagues Andy Bosyi, Mykola Kozlenko, Alex Simkiv, Volodymyr Sendetskyi for their discussions and helpful tips as well as the entire Mindcraft.ai (MindCraft.ai: Data Science Development company | ML & AI Company) team for their constant support. Please share your own stories with Sales Forecasting tasks. 

References: