**Introduction to Sales forecasting Engine**

Time-series Sales forecasting is one of the most important topics in every business, helping to process data taken over a long period of time. Stock exchange, logistics, retail are classic industries where the ability to build predictive models becomes a crucial differentiator in a highly-competitive business environment. In our article we’ll try to explain how time series analysis and sales forecasting methods can be used for typical business tasks: finding hidden patterns, detecting trends in sales over the years, and predicting sales in the future.
## Background Analysis

There are many time series forecasting models out there – from well-known and thoroughly studied, such as Naïve and Exponential Smoothing to the new and hot ones, like Facebook Prophet and LSTM models. In the Naïve model, the forecasts for every horizon correspond to the last observed value and assume that the stochastic model generating the time series is a random walk. An extension of the Naïve model is called SNaïve (Seasonal Naïve), and this one assumes that the time series has a seasonal component. Facebook Prophet is open-source software released by Facebook’s Core Data Science team. This model assumes that time series can be decomposed by the trend, seasonality, holiday, or error components. LSTM stands for Long-Short Term Memories which come from the Neural Network domain. They are mostly used with unstructured data (e.g. audio, text, video), but may benefit from the transfer of learning techniques when applied to standard time series. In this article, we would like to share with you our experience of using the SARIMA model. It works well for the car sales because of the sales seasonality and relative simplicity of this domain.## General Data Description

The dataset contains the number of units and total sales of new motor vehicles, grouped by the vehicle type and the manufacturing location in Canada from 1946 to 2019, with data sampled on a monthly basis. It is accessible here:**Data Set**

```
FILE = './data_input/20100001.csv' # pass to the downloaded csv file
df = pd.read_csv(FILE, quoting=csv.QUOTE_ALL, parse_dates=['REF_DATE'])
```

The initial data structure looks in the following way:
```
print(df.columns)
```*Index(['REF_DATE', 'GEO', 'DGUID', 'Vehicle type', 'Origin of manufacture',*
* 'Sales', 'Seasonal adjustment', 'UOM', 'UOM_ID', 'SCALAR_FACTOR',*
* 'SCALAR_ID', 'VECTOR', 'COORDINATE', 'VALUE', 'STATUS', 'SYMBOL',*
* ** 'TERMINATED', 'DECIMALS'],*
* dtype='object')*

There are 18 columns: 1 DateTime format, 5 numeric, 13 categorical.
```
print(df.dtypes)
```*REF_DATE datetime64[ns]*
*GEO object*
*DGUID object*
*Vehicle type object*
*Origin of manufacture object*
*Sales object*
*Seasonal adjustment object*
*UOM object*
*UOM_ID int64*
*SCALAR_FACTOR object*
*SCALAR_ID int64*
*VECTOR object*
*COORDINATE object*
*VALUE float64*
*STATUS object*
*SYMBOL float64*
*TERMINATED object*
*DECIMALS int64*
*dtype: object*

```
The raw version of our dataset consists of 148’890 observations, covering the period between 1946 and 2019.
print(df.shape)
```*(148890, 18)
*

After filtering the data by: Country, Vehicle type, Manufacturing Location, Seasonal Adjustment, and then dropping all columns, except for the two: REF_DATE and VALUE, we receive the final version of our dataset. It contains 888 rows, 1 numerical column and an index in the DateTime format:
```
condition_filter_1 = df['GEO'] == "Canada"
condition_filter_2 = df['Vehicle type'] == "Total, new motor vehicles"
condition_filter_3 = df['Origin of manufacture'] == "Total, country of manufacture"
condition_filter_4 = df['Sales'] == "Units"
condition_filter_5 = df['Seasonal adjustment'] == "Unadjusted"
df = df.loc[condition_filter_1 &
condition_filter_2 &
condition_filter_3 &
condition_filter_4 &
Condition_filter_5]
…
df.shape
```*(888, 1)*
*...*
df.head()
*REF_DATE VALUE*
*1946-01-01* *2756.0*
*1946-02-01* *3763.0*
*1946-03-01* *7200.0*
*1946-04-01* *9645.0*
*1946-05-01* *12032.0*

Our time series is univariate, which means that:
- We are considering only the relationship between the y-axis value and the x-axis time points.
- We don’t take into account any outside factors that may be affecting the time series.

**Data Analysis**

### Time Domain Analysis

By means of a simple pandas command, we got the overall statistics for new cars sales in Canada between 1946 and 2019:```
print(df.describe())
```* VALUE*
*count* * 888.0*
*mean* * 96643.0*
*std* * 47435.0*
*min* * 2756.0*
*25%* * 57266.0*
*50%* * 97259.0*
*75%* * 131516.0*
*max* * 220436.0*

We quickly plot the data out with the matplotlib package, where y-axis stands for Sales, x-axis stands for years and then we’re able to make some conclusions:
- There is an upward trend over the years

- Mostly, there is a slight upward trend within any single year.
- June and August show peaks in sales.
- November and December are the low seasons.

## Stationarity Testing Using the Augmented Dickey-Fuller Unit Root Test

This is a statistical test that we run to determine if the time series is stationary or not. The null hypothesis of the ADF test is that the time series is non-stationary. In other words, we have two options:- P-value is more than the significance level (0.05), which means that there is no sufficient statistical evidence to reject the null hypothesis.
- here is sufficient statistical evidence to reject the null hypothesis, when p-value of the test is less than the significance level (0.05).

## Frequency (Seasonality) Analysis

We performed the seasonality analysis in order to find out the prominent frequencies (periodicity) in the frequency domain using the Fourier Transform for full-time series and the alternating component of the series. Here is the Energy Spectral Density (ESD) plot for new cars sales in Canada:```
def get_esd(ts: pd.core.series.Series, fs: float) -> tuple:
T = 1 / fs
N = len(ts)
F = 1 / (N * T)
b = (np.abs(T * np.fft.fft(ts))) ** 2
c = b[0:N//2+1]
esd = c * 2
f = np.array(range(int(N/2+1)))
f = f * F
return (esd, f)
timeseries_2 = timeseries - timeseries.mean()
esd_2, f = get_esd(timeseries_2, 1.0)
plt.figure(3)
plt.plot(f, esd_2)
plt.title('Energy Spectral Density')
plt.xlabel('Frequency (1/month)')
plt.ylabel('ESD (sales**2 * month / (1/month))')
plt.grid()
```

The maximum seasonality effects are at frequencies:
1, 2, 74 (1/month)
Periodicity:
96, 48, 1 (months)
## Statistical Properties Analysis

Then, we conducted a statistical analysis to understand the probability distribution of data as well as to prove the normality of distributions. The histogram and the qq-plot of data examples are shown below.## Normality Testing Using the D’Agostino and Pearson’s Test

This test helped us understand how far our data distribution is from the normal one by computing the p-value in hypothesis testing:- null hypothesis – data are from the normal distribution
- alternative hypothesis – data are not from the normal distribution (p-value < 0.05)

```
from scipy.stats import normaltest
ALPHA = .05
stat, p = normaltest(timeseries)
print('p-value = {0}'.format(p))
if p > ALPHA: # null hypothesis: x comes from a normal distribution
print('Sample comes from NORMAL distribution')
else:
print('Sample doesn’t come from NORMAL distribution')
```*p-value = 4.1868e-18*
*Sample doesn’t come from NORMAL distribution*

Judging by the result of our p-value, we could draw the following conclusion – we have some strong enough evidence that the data distribution is not normal.
**Method and Model Description**

There is an upward trend, seasonality and some noise, so we can use the python package called <statsmodels> to perform a decomposition of this time series.
The decomposition of a time series is a statistical task that deconstructs the time series into several components, each representing one of the underlying pattern categories.
Our goal is to build a mathematical model to be able to forecast the future.
Next, we decomposed the observed data (SEASONAL, TREND, and NOISE) with the following code:
```
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(timeseries, model='additive')
result.plot()
```

As a result, we have split the data into 4 components:
From the plot above two things come to light:
- the seasonal component of the data
- the obvious upward trend in the data

```
timeseries_diff_1 = timeseries.diff().dropna()
timeseries_diff_2 = timeseries_diff_1.dropna()
p_value = adfuller(timeseries)[1].round(4)
print('Original Series: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")
p_value = adfuller(timeseries_diff_1)[1].round(4)
print('1st Order Differencing: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")
p_value = adfuller(timeseries_diff_2)[1].round(4)
print('2st Order Differencing: p_value = {0}'.format(p_value))
print("Data are STATIONARY \n" if p_value < .05 else "Data are NOT-STATIONARY \n")
```

Finally, we add the component: seasonality S(P, D, Q, s), where s is simply the length of a season. This component requires the parameters P and Q which are the same as p and q, but used for the seasonal component. Finally, D is the order of seasonal integration representing the number of differences required to remove seasonality from the series.
Combining these all, we get the SARIMA(p, d, q)(P, D, Q, s), model.
The main takeaway is: before modeling with SARIMA, we must apply transformations to our time series to remove seasonality and any non-stationary behaviors.
Another method (less time-consuming) is to perform a grid search over multiple values of p, d, q, P, D, and Q using some sort of performance criteria. By default, the AIC gives us a numerical value of the quality of each model, relative to each of the other ones. To be precise – the lower our AIC is, the better the model.
The pyramid-arima contains an auto_arima function that allows setting a range of p, d, q, P, D, and Q values and then fitting the models for all possible combinations. To get a summary of the best model we can use the following line of code:
```
# Fit your model
model_arima = pm.auto_arima(train, max_p=4, max_q=4,
seasonal=True, m=12, trend='c')
model_arima.summary()
```

**Training **

The dataset was split into training and test sets, and then we trained the model by using the training data
```
def train_test_split(ts: pd.core.series.Series, test_size: int) -> tuple:
train_size = len(ts) - test_size
return (ts[:train_size], ts[train_size:])
# splitting the data into train / test
train, test = train_test_split(ts=timeseries, test_size=TEST_PERIOD*12)
model = sm.tsa.statespace.SARIMAX(train,
order=(1, 1, 0),
seasonal_order=(2, 0, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit()
print(results.summary())
```

The best model parameters (SARIMAX(1, 1, 0)x(2, 0, 1, 12)) give us the lowest AIC:
AIC = 14957.224.
**Prediction and Evaluation**

After our model has been fitted to the training data, we can try to forecast the future.
To get predictions about the future sales we can use the get_prediction() method of the model
```
pred = results.get_prediction(start=len(timeseries)-len(test), end=len(timeseries)-1, dynamic=True, full_results=True)
pred_ci = pred.conf_int(alpha=0.05)
pred_ci_int = pred_ci.astype(int)
forecasts = pred.predicted_mean
forecasts_int = forecasts.astype(int)
test_int = test.astype(int)
error = round((np.abs(test_int - forecasts_int) / test_int) * 100, 1)
pred_ci_int.loc[pred_ci_int['lower VALUE'] <= 0, 'lower VALUE'] = 0
pred_ci_int.loc[pred_ci_int['upper VALUE'] <= 0, 'upper VALUE'] = 0
result = pd.concat([test_int, forecasts_int, error, pred_ci_int], axis=1)
result.rename(columns={'VALUE': 'Ground Truth',
0: 'Forecast', 1: "Relative Error %",
'lower VALUE': 'Confidence lower',
'upper VALUE': 'Confidence upper'},
inplace=True)
```

Here we plot our forecasted values to view how well our predictions match up with the test set with the real data:
We can also just compare this to the entire data set to see the general context of our prediction
Let’s use the prediction to forecast the nearest 12 months (in our case – the whole year 2020):
```
forecast_months = 12 # number of forecast months
fcast = results.predict(len(timeseries),
len(timeseries)+forecast_months,
typ='levels').rename('SARIMAX Forecast')
```

This is how the 12 months of 2020 are forecasted to look like:
```
fcast.astype('int')
```

In the forecasted data we see the same patterns as in the previous years:
- The worst months for sales are the first and the last months of the year, which totally makes sense in real life (Who thinks about buying cars before and after Christmas, right?)
- There is an upward trend with May being the first peak. Then, the trend goes down, with July being the bottom point.
- The second peak is in August, and the next downward trend culminates in December.

**Results:**

```
def mean_percentage_error(y_true, y_pred):
return 100*np.mean((y_true - y_pred) / y_true)
print('MAE: \t{0:.1f}'.format(mean_absolute_error(test, forecasts)))
print('MSE: \t{0:.1f}'.format(mean_squared_error(test, forecasts)))
print('RMSE: \t{0:.1f}'.format(np.sqrt(mean_squared_error(test, forecasts))))
print('MAPE: \t{0:3.4f} %'.format(mean_percentage_error(test, forecasts)))
```

```
percentage_error = (100*(test - forecasts) / test).round(4)
forecast_df = pd.concat([round(test,0), round(forecasts,0), percentage_error], axis=1)
forecast_df.columns = ['Ground_truth','Prediction', 'Percentage_error']
forecast_df = forecast_df.loc['2019-01-01':].reset_index().rename(columns={'index': 'Month'})
print(forecast_df)
print('\n Maximum Percentage_error: ', np.abs(forecast_df['Percentage_error']).max(), '%')
```