Manish Poddar
28 min readApr 25, 2021

Building forecasting model using Vector autoregression (VAR)

In this blog post, I have described how to build a forecasting model using Vector autoregression(VAR).

Vector autoregression (VAR) is a statistical model used to capture the relationship between multiple quantities as they change over time. VAR is a type of stochastic process model. VAR models generalize the single-variable (univariate) autoregressive model by allowing for multivariate time series. VAR models are often used in economics and the natural sciences. (from Wikipedia)

Problem Statement

For this demonstration, I have taken a Kaggle problem of Rossmann store sales (https://www.kaggle.com/c/rossmann-store-sales/overview).

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Since the company has just started. We have a scope of forecasting for stores 1,3,8,9,13,25,29,31 and 46.

Here, I will perform a time-series data analysis for the above store mentioned and will build the model for stores 1 and 46.

Here, we will explore various model building approach with the VAR model for individual store

The procedure to build a VAR model involves the following steps:

  • Analyze the time series characteristics
  • Test for causation amongst the time series
  • Test for stationarity
  • Transform the series to make it stationary, if needed
  • Find optimal order (p)
  • Prepare training and test datasets
  • Train the model
  • Rollback the transformations, if any.
  • Evaluate the model using a test set
  • Forecast to future
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import Statsmodels
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

1. Import the datasets

types = types = {'CompetitionOpenSinceYear': np.dtype(int),'CompetitionOpenSinceMonth': np.dtype(int),'StateHoliday': np.dtype(str),'Promo2SinceWeek': np.dtype(int),'SchoolHoliday': np.dtype(float),'PromoInterval': np.dtype(str)}
train = pd.read_csv("dataset/train.csv", parse_dates=[2], dtype=types)
store = pd.read_csv("dataset/store.csv")
# see head values for train
train.head()
png
# see head values for store
store.head(5)
png

2. Performing time series analysis

# merge data for analysis
train_store = pd.merge(train, store, on='Store') #merge data for analysis
train_store['Open'] = train_store['Open'].apply(lambda x: 0 if np.isnan(x) else x) #get rid of NaN values if store is 'closed'
#a better way to get more details, timeline view of medians of each day of the week-
day = train_store[(train_store['Open']!=0)] # Looking on open store
sales_day = day.groupby('DayOfWeek')['Sales'].median() # Grouping by day of week for sales
cust_day = day.groupby('DayOfWeek')['Customers'].median() # Grouping by day of week to see customers
fig, (axis1) = plt.subplots(1,1, sharex=True, figsize=(10,5))
# plot median sales
ax1 = sales_day.plot(legend=True, ax=axis1, marker='o',title="Fig: Median sales of each day of the week")
ax1.set_xticks(sales_day.index)
tmp = ax1.set_xticklabels(sales_day.index.tolist(), rotation=90)
# overlay customer data
cust_day.plot(legend=True, ax=axis1, marker='x', secondary_y=True)
<AxesSubplot:label='faa9fa8a-25d3-4517-8800-d3bc405d5484'>
png

Conclusion

  • Clearly, we can see that we have high sales and high no of customers coming in beginning and end of week
  • On Saturday sales are least
  • On Sunday most of the customer comes
#for more insights lets split Year-Month-Date to three different columns
def date_change(data):
data['Month'] = data['Date'].apply(lambda x : int(str(x)[5:7]))
data['Year'] = data['Date'].apply(lambda x : int(str(x)[:4]))
data['MonthYear'] = data['Date'].apply(lambda x : (str(x)[:7]))
data['date_int'] = data['Date'].apply(lambda x : (str(x)[8:10]))
return data
train_store = date_change(train_store)
import calendar
# select all stores that were open
subs = train_store[train_store['Open']!=0]
# groupby Year and Month
selected_sales = subs.groupby(['Year', 'Month'])['Sales'].mean()
selected_cust = subs.groupby(['Year', 'Month'])['Customers'].mean()
# plot
fig, (axis1) = plt.subplots(1,1, figsize=(10,7))
selected_sales.unstack().T.plot(ax=axis1)
tmp = axis1.set_title("Fig: Yearly Sales")
tmp = axis1.set_ylabel("Sales")
tmp = axis1.set_xticks(range(0,13))
tmp = axis1.set_xticklabels(calendar.month_abbr)
png

Conclusion :

  • Clearly, we can see that sales are increasing in the year last
  • All three-year data sets which we have follows quite similar trends
# median sales
median_sales = train_store.groupby('MonthYear')['Sales'].median()
pct_median_change = train_store.groupby('MonthYear')['Sales'].median().pct_change()
# median customers
median_cust = train_store.groupby('MonthYear')['Customers'].median()
pct_median_custchange = train_store.groupby('MonthYear')['Customers'].median().pct_change()

fig, (axis1, axis2) = plt.subplots(2, 1, sharex=True, figsize=(15,15))
# plot median sales
ax1 = median_sales.plot(legend=True, ax=axis1, marker='o',title="Fig : Monthly changes in Sales/Customers")
ax1.set_xticks(range(len(median_sales)))
ax1.set_xticklabels(median_sales.index.tolist(), rotation=90)
# plot pct change
ax2 = pct_median_change.plot(legend=True, ax=axis2, marker='o',rot=90, title="Fig : Percent Change")
# overlay customer data
median_cust.plot(legend=True, ax=axis1, marker='x', secondary_y=True)
pct_median_custchange.plot(legend=True, ax=axis2, marker='x', rot=90, secondary_y=True)
<AxesSubplot:label='11736469-6019-440f-83e0-d06ef44fed21'>
png

Conclusion

  • Sales and customer are directly proportional
  • In Jan sales are dropping
  • In Dec we are getting most of the sales

Plotting sales at store level

def plot_sales_store(store_id):
#a better way to get more details, timeline view of medians of each day of the week-
store_df = train_store[(train_store['Open']!=0)&(train_store['Store']==store_id)] # Looking on open store
sales_day = store_df.groupby('DayOfWeek')['Sales'].median() # Grouping by day of week for sales
cust_day = store_df.groupby('DayOfWeek')['Customers'].median() # Grouping by day of week to see customers
fig, (axis1) = plt.subplots(1,1, sharex=True, figsize=(10,5))
# plot median sales
ax1 = sales_day.plot(legend=True, ax=axis1, marker='o',title="Fig. Median sales of each day of the week for store : "+str(store_id))
ax1.set_xticks(sales_day.index)
tmp = ax1.set_xticklabels(sales_day.index.tolist(), rotation=90)
# overlay customer data
cust_day.plot(legend=True, ax=axis1, marker='x', secondary_y=True)
stores_to_plot = [1,3,8,9,13,25,29,31,46]
for i in stores_to_plot:
plot_sales_store(i)
png
png
png
png
png
png
png
png
png

Conclusion: Clearly, We can see that sales distribution is different for the different store so we will need to create a different model for different stores

3. Preprocessing Data sets

# remove NaNs-
train_store.fillna(0, inplace=True)
#take the store as open if no value given-
train_store.loc[train_store.Open.isnull(), 'Open'] = 1
# Label encode some features
mappings = {'0':0, 'a':1, 'b':2, 'c':3, 'd':4}
train_store.StoreType.replace(mappings, inplace=True)
train_store.Assortment.replace(mappings, inplace=True)
train_store.StateHoliday.replace(mappings, inplace=True)
# Calculate time competition open time in months
train_store['CompetitionOpen'] = 12 * (train_store.Year - train_store.CompetitionOpenSinceYear) + (train_store.Month - train_store.CompetitionOpenSinceMonth)
# Promo open time in months
train_store['PromoOpen'] = 12 * (train_store.Year - train_store.Promo2SinceYear) + (train_store.Date.dt.weekofyear - train_store.Promo2SinceWeek) / 4.0
train_store['PromoOpen'] = train_store.PromoOpen.apply(lambda x: x if x > 0 else 0)
train_store.loc[train_store.Promo2SinceYear == 0, 'PromoOpen'] = 0
<ipython-input-15-4d14b3963355>:2: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated. Please use Series.dt.isocalendar().week instead.
train_store['PromoOpen'] = 12 * (train_store.Year - train_store.Promo2SinceYear) + (train_store.Date.dt.weekofyear - train_store.Promo2SinceWeek) / 4.0

keeping only those store info for which we will be building models

train_store = train_store[(train_store.Store==1)|(train_store.Store==3)|(train_store.Store==8)|(train_store.Store==9)\
|(train_store.Store==13)|(train_store.Store==25)|(train_store.Store==29)|(train_store.Store==31)\
|(train_store.Store==46)] # Filtering required data

4. Outlier treatment

# Plotting outlier of Store no of which we are going to build model (Store : 1,3,8,9,13,25,29,31 and 46)
def plot_sales(store_id):
fig = plt.subplots(figsize=(12, 2))
ax = sns.boxplot(x=train_store[train_store.Store==store_id]['Sales'],whis=1.5)
for val in [1,3,8,9,13,25,29,31,46]:
plot_sales(val)
png
png
png
png
png
png
png
png
png
# Plotting outlier of Store no of which we are going to build model (Store : 1,3,8,9,13,25,29,31 and 46)
def plot_customers(store_id):
fig = plt.subplots(figsize=(12, 2))
ax = sns.boxplot(x=train_store[train_store.Store==store_id]['Customers'],whis=1.5)
for val in [1,3,8,9,13,25,29,31,46]:
plot_customers(val)
png
png
png
png
png
png
png
png
png
# Capping at 99 percentile for customer data
def cap_99_percentils_customers(store_id):
percentiles = train_store[train_store.Store==store_id]['Customers'].quantile([0.01,0.99]).values
train_store.Customers = train_store[['Customers','Store']].apply(lambda x: x.Customers if x.Customers >= percentiles[0] and x.Store == store_id else percentiles[0] if x.Store == store_id else x.Customers ,axis=1)
train_store.Customers = train_store[['Customers','Store']].apply(lambda x: x.Customers if x.Customers <= percentiles[1] and x.Store == store_id else percentiles[1] if x.Store == store_id else x.Customers ,axis=1)
# Capping at 99 percentile for sales data
def cap_99_percentils_sales(store_id):
percentiles = train_store[train_store.Store==store_id]['Sales'].quantile([0.01,0.99]).values
train_store.Sales = train_store[['Sales','Store']].apply(lambda x: x.Sales if x.Sales >= percentiles[0] and x.Store == store_id else percentiles[0] if x.Store == store_id else x.Sales ,axis=1)
train_store.Sales = train_store[['Sales','Store']].apply(lambda x: x.Sales if x.Sales <= percentiles[1] and x.Store == store_id else percentiles[1] if x.Store == store_id else x.Sales ,axis=1)
for val in [1,3,8,9,13,25,29,31,46]:
cap_99_percentils_sales(val) # capping values for sales
cap_99_percentils_customers # capping values for customer
# Plotting outlier for salesa after capping
def plot_sales(store_id):
fig = plt.subplots(figsize=(12, 2))
ax = sns.boxplot(x=train_store[train_store.Store==store_id]['Sales'],whis=1.5)
for val in [1,3,8,9,13,25,29,31,46]:
plot_sales(val)
png
png
png
png
png
png
png
png
png
# Plotting outlier for customer after capping
for val in [1,3,8,9,13,25,29,31,46]:
plot_customers(val)
png
png
png
png
png
png
png
png
png

conclusion

  • After removing outliers our data set is ready for further analysis of model building

5. Model Building

5.1.0 Building model for store 1

# Filtering out data set for store 1
# Setting Date as index column
train_store_1 = train_store[train_store.Store==1]
train_store_1.sort_values(by='Date', ascending=True,inplace=True)
train_store_1 = train_store_1.set_index('Date')
train_store_1.head()
<ipython-input-24-633006170b3e>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
train_store_1.sort_values(by='Date', ascending=True,inplace=True)
png
# Checking the shape of data to see if data is missing
train_store_1.shape
(942, 23)

Conclusion

  • There are 942 days between 2013–01–01 and 2015–07–31 so none of the data is missing

5.1.1 Testing Causation using Granger’s Causality Test

The basis behind Vector AutoRegression is that each of the time series in the system influences each other. That is, you can predict the series with past values of itself along with other series in the system.

Using Granger’s Causality Test, it’s possible to test this relationship before even building the model.

So what does Granger’s Causality really test?

Granger’s causality tests the null hypothesis that the coefficients of past values in the regression equation is zero.

In simpler terms, the past values of time series (X) do not cause the other series (Y). So, if the p-value obtained from the test is lesser than the significance level of 0.05, then, you can safely reject the null hypothesis.

The below code implements the Granger’s Causality test for all possible combinations of the time series in a given data frame and stores the p-values of each combination in the output matrix.

train_store_1.drop(['Store'], axis=1, inplace=True)  # As we have already filtered out for store 1train_store_1.drop(['StoreType'], axis=1, inplace=True) # Values are constant for a particular storetrain_store_1.drop(['Assortment'],axis=1,inplace=True) # Values are constant for a particular store# This doesn't varies with respect to time so we can take it as exogenius variable in var max
train_store_1.drop(['CompetitionDistance','CompetitionOpenSinceMonth','CompetitionOpenSinceYear','MonthYear','Promo2','Promo2SinceYear','PromoInterval','PromoOpen','Promo2SinceWeek'],axis=1,inplace=True)
train_store_1.columnsIndex(['DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday',
'SchoolHoliday', 'Month', 'Year', 'date_int', 'CompetitionOpen'],
dtype='object')
from statsmodels.tsa.stattools import grangercausalitytests
maxlag=12
test = 'ssr_chi2test'
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):
"""Check Granger Causality of all possible combinations of the Time series.
The rows are the response variable, columns are predictors. The values in the table
are the P-Values. P-Values lesser than the significance level (0.05), implies
the Null Hypothesis that the coefficients of the corresponding past values is
zero, that is, the X does not cause Y can be rejected.

data : pandas dataframe containing the time series variables
variables : list containing names of the time series variables.
"""
df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for c in df.columns:
for r in df.index:

test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)

p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
min_p_value = np.min(p_values)
df.loc[r, c] = min_p_value
df.columns = [var + '_x' for var in variables]
df.index = [var + '_y' for var in variables]
return df

grangers_causation_matrix(train_store_1, variables = train_store_1.columns)
png
train_store_1.drop(['Month','Year','date_int','CompetitionOpen'],axis=1,inplace=True) # Values are ctrain_store_1.head()
png

5.1.2 Cointegration Test

The cointegration test helps to establish the presence of a statistically significant connection between two or more time series.

But, what does Cointegration mean?

To understand that, you first need to know what is ‘order of integration’ (d).

Order of integration(d) is nothing but the number of differences required to make a non-stationary time series stationary.

Now, when you have two or more time series, and there exists a linear combination of them that has an order of integration (d) less than that of the individual series, then the collection of series is said to be cointegrated.

When two or more time series are cointegrated, it means they have a long-run, statistically significant relationship.

This is the basic premise on which Vector Autoregression(VAR) models are based on. So, it’s fairly common to implement the cointegration test before starting to build VAR models.

Alright, So how to do this test?

Soren Johanssen in his paper (1991) devised a procedure to implement the cointegration test

from statsmodels.tsa.vector_ar.vecm import coint_johansen

def cointegration_test(df, alpha=0.05):
"""Perform Johanson's Cointegration Test and Report Summary"""
out = coint_johansen(df,-1,5)
d = {'0.90':0, '0.95':1, '0.99':2}
traces = out.lr1
cvts = out.cvt[:, d[str(1-alpha)]]
def adjust(val, length= 6): return str(val).ljust(length)

# Summary
print('Name :: Test Stat > C(95%) => Signif \n', '--'*20)
for col, trace, cvt in zip(df.columns, traces, cvts):
print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' => ' , trace > cvt)

cointegration_test(train_store_1)
Name :: Test Stat > C(95%) => Signif
----------------------------------------
DayOfWeek :: 1450.79 > 111.7797 => True
Sales :: 640.86 > 83.9383 => True
Customers :: 397.66 > 60.0627 => True
Open :: 170.33 > 40.1749 => True
Promo :: 83.99 > 24.2761 => True
StateHoliday :: 40.15 > 12.3212 => True
SchoolHoliday :: 0.05 > 4.1296 => False

5.1.3 Split the Series into Training and Testing Data and standardizing customer and sales

Standardizing customer and sales data

from sklearn import preprocessing
# Get column names first
names = ['Sales','Customers']
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df = scaler.fit_transform(train_store_1)
scaled_df = pd.DataFrame(train_store_1, columns=names)
train_store_1.Sales = scaled_df.Sales
train_store_1.Customers = scaled_df.Customers
train_store_1.head()
png

Finding date from where we have to split data into train and test

from datetime import datetime, timedelta
start_index = train_store_1.index.min()
end_index = train_store_1.index.max()
print(start_index) # Start index
print(end_index) # End index
print(train_store_1.index.max() - timedelta(days=42)) # Finding the date from where we have to forecast we are taking 42 days for forecasting
2013-01-01 00:00:00
2015-07-31 00:00:00
2015-06-19 00:00:00

Splitting data into train and test set

df_train, df_test = train_store_1.loc['2013-01-01':'2015-06-19'], train_store_1.loc['2015-06-20':]

# Check size
print(df_train.shape) # (119, 8)
print(df_test.shape) # (4, 8)
(900, 7)
(42, 7)

5.1.4 Check for Stationarity and Make the Time Series Stationary

Since the VAR model requires the time series you want to forecast to be stationary, it is customary to check all the time series in the system for stationarity.

Just to refresh, a stationary time series is one whose characteristics like mean and variance does not change over time.

Let's use the ADF test for the same

def adfuller_test(series, signif=0.05, name='', verbose=False):
"""Perform ADFuller to test for Stationarity of given series and print report"""
r = adfuller(series, autolag='AIC')
output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
p_value = output['pvalue']
def adjust(val, length= 6): return str(val).ljust(length)

# Print Summary
print(f' Augmented Dickey-Fuller Test on "{name}"', "\n ", '-'*47)
print(f' Null Hypothesis: Data has unit root. Non-Stationary.')
print(f' Significance Level = {signif}')
print(f' Test Statistic = {output["test_statistic"]}')
print(f' No. Lags Chosen = {output["n_lags"]}')

for key,val in r[4].items():
print(f' Critical value {adjust(key)} = {round(val, 3)}')

if p_value <= signif:
print(f" => P-Value = {p_value}. Rejecting Null Hypothesis.")
print(f" => Series is Stationary.")
else:
print(f" => P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
print(f" => Series is Non-Stationary.")
# ADF Test on each column
for name, column in df_train.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Augmented Dickey-Fuller Test on "DayOfWeek"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -287422830162108.06
No. Lags Chosen = 7
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.568
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Sales"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -4.2531
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0005. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Customers"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -3.6017
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0057. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Open"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -5.7249
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Promo"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -8.7921
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "StateHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -14.1637
No. Lags Chosen = 3
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.568
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "SchoolHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -5.4211
No. Lags Chosen = 12
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.

Conclusion

  • Here we can see that all the columns used are stationary
  • So we can proceed ahead with these data sets

5.1.5 How to Select the Order (P) of the VAR model

To select the right order of the VAR model, we iteratively fit increasing orders of the VAR model and pick the order that gives a model with the least AIC.

Though the usual practice is to look at the AIC, you can also check other best fit comparison estimates of BIC, FPE, and HQIC.

model = VAR(df_train)
for i in [1,2,3,4,5,6,7,8,9]:
result = model.fit(i)
print('Lag Order =', i)
print('AIC : ', result.aic)
print('BIC : ', result.bic)
print('FPE : ', result.fpe)
print('HQIC: ', result.hqic, '\n')
Lag Order = 1
AIC : 6.478281342619523
BIC : 6.777360152330769
FPE : 650.8535343606709
HQIC: 6.592537697350863

Lag Order = 2
AIC : 5.837179212729006
BIC : 6.398446314256546
FPE : 342.81843602637866
HQIC: 6.051610171083002

Lag Order = 3
AIC : 5.023721840952075
BIC : 5.8476400157091035
FPE : 151.98635009225518
HQIC: 5.338514926446016

Lag Order = 4
AIC : 4.0603131745217445
BIC : 5.147346549336377
FPE : 58.00165239125097
HQIC: 4.475656469676899

Lag Order = 5
AIC : 1.885875402073503
BIC : 3.2364894545933502
FPE : 6.59412695453832
HQIC: 2.4019575506910904

Lag Order = 6
AIC : -49.26685529477507
BIC : -47.65219373065065
FPE : 4.0170056049621183e-22
HQIC: -48.64984508532618

Lag Order = 7
AIC : -53.584712126042945
BIC : -51.705534854702776
FPE : 5.355643143425765e-24
HQIC: -52.86658408252669

Lag Order = 8
AIC : -46.87757071756594
BIC : -44.73340817619887
FPE : 4.3838920141864546e-21
HQIC: -46.058134498567696

Lag Order = 9
AIC : -48.67645510602825
BIC : -46.266836359106904
FPE : 7.258344596034241e-22
HQIC: -47.75551979963143

We can use the model summary method to find the optimal values

x = model.select_order(maxlags=50)
x.summary()

Conclusion

  • Running the above summary report we find that at lag = 17 we find the best metrics. so building model using lag =17
model_fitted = model.fit(17)
model_fitted.summary()


Summary of Regression Results
==================================
Model: VAR
Method: OLS
Date: Mon, 12, Apr, 2021
Time: 10:47:22
--------------------------------------------------------------------
No. of Equations: 7.00000 BIC: -51.5468
Nobs: 883.000 HQIC: -54.3573
Log likelihood: 16836.4 FPE: 4.38993e-25
AIC: -56.0972 Det(Omega_mle): 1.79919e-25
--------------------------------------------------------------------
# Get the lag order
lag_order = model_fitted.k_ar
print(lag_order)

# Input data for forecasting
forecast_input = df_train.values[-lag_order:]
forecast_input
17


array([[3.000e+00, 5.809e+03, 6.160e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[4.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.000e+00, 1.000e+00,
0.000e+00],
[5.000e+00, 5.384e+03, 6.090e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[6.000e+00, 4.183e+03, 4.600e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[7.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[1.000e+00, 4.071e+03, 5.020e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[2.000e+00, 4.102e+03, 4.850e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[3.000e+00, 3.591e+03, 4.530e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[4.000e+00, 3.627e+03, 4.420e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[5.000e+00, 3.695e+03, 4.220e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[6.000e+00, 4.256e+03, 5.020e+02, 1.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[7.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
0.000e+00],
[1.000e+00, 5.518e+03, 5.860e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[2.000e+00, 4.852e+03, 5.030e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[3.000e+00, 4.000e+03, 4.760e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[4.000e+00, 4.645e+03, 4.980e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00],
[5.000e+00, 4.202e+03, 4.870e+02, 1.000e+00, 1.000e+00, 0.000e+00,
0.000e+00]])

5.1.6 Model evaluation

Forecasting with the results obtained

# Forecast
fc = model_fitted.forecast(y=forecast_input, steps=42)
df_forecast = pd.DataFrame(fc, index=train_store_1.index[-42:], columns=train_store_1.columns)
df_forecast
png
fig, axes = plt.subplots(nrows=int(len(train_store_1.columns)/2), ncols=2, dpi=150, figsize=(10,10))
for i, (col,ax) in enumerate(zip(train_store_1.columns, axes.flatten())):
df_forecast[col].plot(legend=True, ax=ax).autoscale(axis='x',tight=True)
df_test[col][-42:].plot(legend=True, ax=ax);
ax.set_title(col + ": Forecast vs Actuals")
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.spines["top"].set_alpha(0)
ax.tick_params(labelsize=6)

plt.tight_layout();
png

Conclusion

  • Clearly, we can see that we are able to forecast sales quite accurately
  • Also, No of customer coming to shops are also forecasted accurately

Let’s see other metrics for our predictions

from statsmodels.tsa.stattools import acf
def forecast_accuracy(forecast, actual):
mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE
me = np.mean(forecast - actual) # ME
mae = np.mean(np.abs(forecast - actual)) # MAE
mpe = np.mean((forecast - actual)/actual) # MPE
rmse = np.mean((forecast - actual)**2)**.5 # RMSE
corr = np.corrcoef(forecast, actual)[0,1] # corr
mins = np.amin(np.hstack([forecast[:,None],
actual[:,None]]), axis=1)
maxs = np.amax(np.hstack([forecast[:,None],
actual[:,None]]), axis=1)
minmax = 1 - np.mean(mins/maxs) # minmax
return({'mape':mape, 'me':me, 'mae': mae,
'mpe': mpe, 'rmse':rmse, 'corr':corr, 'minmax':minmax})
print('Forecast Accuracy of: Sales')
def adjust(val, length= 6): return str(val).ljust(length)
accuracy_prod = forecast_accuracy(df_forecast['Sales'].values, df_test.Sales)
for k, v in accuracy_prod.items():
print(adjust(k), ': ', round(v,4))
Forecast Accuracy of: Sales
mape : inf
me : -21.6444
mae : 396.8509
mpe : nan
rmse : 504.838
corr : 0.9541
minmax : inf

5.2.0 Building model for store 46

# Filtering out data set for store 46
# Setting Date as index column
train_store_46 = train_store[train_store.Store==46]
train_store_46.sort_values(by='Date', ascending=True,inplace=True)
train_store_46 = train_store_46.set_index('Date')
train_store_46.head()
<ipython-input-214-76e9b1a1370b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
train_store_46.sort_values(by='Date', ascending=True,inplace=True)
png
train_store_46.shape(758, 23)

5.2.1 Handling missing values

It seems that we have missing values for store 46 let's perform imputation for missing values

train_store_46['Sales'].plot(figsize=(12, 4))
plt.legend(loc='best')
plt.title('Sales')
plt.show(block=False)
png

Clearly, from graph we can see that we have 6 months of missing data from 2014–07–01 to 2014–12–31 anad also we can see that index is missing.

So we are first creating an index with null values and will then impute the values

# Genearating missing index
start = pd.Timestamp('2014-07-01 00:00:00') # start date
end = pd.Timestamp('2014-12-31 00:00:00') # end date
null_df = pd.DataFrame({'DayOfWeek': np.nan,'Sales': np.nan,'Customers':np.nan,'Open':np.nan,'Promo':np.nan,'StateHoliday':np.nan,'SchoolHoliday':np.nan}, index=pd.date_range(start, end, freq='D'))
train_store_46 = pd.concat([train_store_46.loc[:'2014-07-01'], null_df, train_store_46.loc['2015-01-01':]]) # genearating missing index
# Checking how missing index are imputed
train_store_46['Sales'].plot(figsize=(12, 4))
plt.legend(loc='best')
plt.title('Sales')
plt.show(block=False)
png

Imputing the values

Imputing Sales values

train_store_46.Sales.fillna(train_store_46.Sales.mean(),inplace=True)# Checking how missing index are imputed
train_store_46['Sales'].plot(figsize=(12, 4))
plt.legend(loc='best')
plt.title('Sales')
plt.show(block=False)
png
# Imputing Customer
train_store_13.Customers.fillna(train_store_13.Customers.mean(),inplace=True)
# Imputing open values
train_store_13.Open.fillna(train_store_13.Open.mode()[0],inplace=True)
# Imputing Promo values
train_store_13.Promo.fillna(train_store_13.Promo.mode()[0],inplace=True)
# Imputing state holiday values
train_store_13.StateHoliday.fillna(train_store_13.StateHoliday.mode()[0],inplace=True)
# Imputing school holiday values
train_store_13.SchoolHoliday.fillna(train_store_13.SchoolHoliday.mode()[0],inplace=True)
# Imputing day of week values
train_store_13.DayOfWeek.fillna(train_store_13.DayOfWeek.mode()[0],inplace=True)

5.2.2 Testing Causation using Granger’s Causality Test

The basis behind Vector AutoRegression is that each of the time series in the system influences each other. That is, you can predict the series with past values of itself along with other series in the system.

Using Granger’s Causality Test, it’s possible to test this relationship before even building the model.

So what does Granger’s Causality really test?

Granger’s causality tests the null hypothesis that the coefficients of past values in the regression equation are zero.

In simpler terms, the past values of time series (X) do not cause the other series (Y). So, if the p-value obtained from the test is lesser than the significance level of 0.05, then, you can safely reject the null hypothesis.

The below code implements the Granger’s Causality test for all possible combinations of the time series in a given data frame and stores the p-values of each combination in the output matrix.

train_store_46.drop(['Store'], axis=1, inplace=True)  # As we have already filtered out for store 1train_store_46.drop(['StoreType'], axis=1, inplace=True) # Values are constant for a particular storetrain_store_46.drop(['Assortment'],axis=1,inplace=True) # Values are constant for a particular store# This doesn't varies with respect to time so we can take it as exogenius variable in var max
train_store_46.drop(['CompetitionDistance','CompetitionOpenSinceMonth','CompetitionOpenSinceYear','MonthYear','Promo2','Promo2SinceYear','PromoInterval','PromoOpen','Promo2SinceWeek'],axis=1,inplace=True)
from statsmodels.tsa.stattools import grangercausalitytests
maxlag=12
# remove NaNs-
train_store_46.fillna(0, inplace=True)
test = 'ssr_chi2test'
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):
"""Check Granger Causality of all possible combinations of the Time series.
The rows are the response variable, columns are predictors. The values in the table
are the P-Values. P-Values lesser than the significance level (0.05), implies
the Null Hypothesis that the coefficients of the corresponding past values is
zero, that is, the X does not cause Y can be rejected.

data : pandas dataframe containing the time series variables
variables : list containing names of the time series variables.
"""
df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for c in df.columns:
for r in df.index:

test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)

p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
min_p_value = np.min(p_values)
df.loc[r, c] = min_p_value
df.columns = [var + '_x' for var in variables]
df.index = [var + '_y' for var in variables]
return df

grangers_causation_matrix(train_store_46, variables = train_store_46.columns)
png
train_store_46.drop(['Month','Year','date_int','CompetitionOpen'],axis=1,inplace=True) # P value is greater thaaaaaaan 0.05grangers_causation_matrix(train_store_46, variables = train_store_46.columns) # CHecking for P values after dropping columns
png

5.2.3 Cointegration Test

The cointegration test helps to establish the presence of a statistically significant connection between two or more time series.

But, what does Cointegration mean?

To understand that, you first need to know what is ‘order of integration’ (d).

Order of integration(d) is nothing but the number of differences required to make a non-stationary time series stationary.

Now, when you have two or more time series, and there exists a linear combination of them that has an order of integration (d) less than that of the individual series, then the collection of series is said to be cointegrated.

When two or more time series are cointegrated, it means they have a long run, statistically significant relationship.

This is the basic premise on which Vector Autoregression(VAR) models is based on. So, it’s fairly common to implement the cointegration test before starting to build VAR models.

Alright, So how to do this test?

Soren Johanssen in his paper (1991) devised a procedure to implement the cointegration test

from statsmodels.tsa.vector_ar.vecm import coint_johansen

def cointegration_test(df, alpha=0.05):
"""Perform Johanson's Cointegration Test and Report Summary"""
out = coint_johansen(df,-1,5)
d = {'0.90':0, '0.95':1, '0.99':2}
traces = out.lr1
cvts = out.cvt[:, d[str(1-alpha)]]
def adjust(val, length= 6): return str(val).ljust(length)

# Summary
print('Name :: Test Stat > C(95%) => Signif \n', '--'*20)
for col, trace, cvt in zip(df.columns, traces, cvts):
print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' => ' , trace > cvt)

cointegration_test(train_store_46)
Name :: Test Stat > C(95%) => Signif
----------------------------------------
DayOfWeek :: 1333.84 > 111.7797 => True
Sales :: 621.47 > 83.9383 => True
Customers :: 362.73 > 60.0627 => True
Open :: 132.4 > 40.1749 => True
Promo :: 54.5 > 24.2761 => True
StateHoliday :: 3.89 > 12.3212 => False
SchoolHoliday :: 0.11 > 4.1296 => False

5.2.4 Split the Series into Training and Testing Data and standardizing customer and sales

Standardizing customer and sales data

from sklearn import preprocessing
# Get column names first
names = ['Sales','Customers']
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df = scaler.fit_transform(train_store_46)
scaled_df = pd.DataFrame(train_store_46, columns=names)
train_store_46.Sales = scaled_df.Sales
train_store_46.Customers = scaled_df.Customers
train_store_46.head()
png

Finding date from where we have to split data into train and test

from datetime import datetime, timedelta
start_index = train_store_46.index.min()
end_index = train_store_46.index.max()
print(start_index) # Start index
print(end_index) # End index
print(train_store_46.index.max() - timedelta(days=42)) # Finding the date from where we have to forecast we are taking 42 days for forecasting
2013-01-01 00:00:00
2015-07-31 00:00:00
2015-06-19 00:00:00

Splitting data into train and test set

df_train, df_test = train_store_46.loc['2013-01-01':'2015-06-19'], train_store_46.loc['2015-06-20':]

# Check size
print(df_train.shape) # (119, 8)
print(df_test.shape) # (4, 8)
(900, 7)
(42, 7)

5.2.5 Check for Stationarity and Make the Time Series Stationary

Since the VAR model requires the time series you want to forecast to be stationary, it is customary to check all the time series in the system for stationarity.

Just to refresh, a stationary time series is one whose characteristics like mean and variance does not change over time.

Let's use the ADF test for the same

def adfuller_test(series, signif=0.05, name='', verbose=False):
"""Perform ADFuller to test for Stationarity of given series and print report"""
r = adfuller(series, autolag='AIC')
output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
p_value = output['pvalue']
def adjust(val, length= 6): return str(val).ljust(length)

# Print Summary
print(f' Augmented Dickey-Fuller Test on "{name}"', "\n ", '-'*47)
print(f' Null Hypothesis: Data has unit root. Non-Stationary.')
print(f' Significance Level = {signif}')
print(f' Test Statistic = {output["test_statistic"]}')
print(f' No. Lags Chosen = {output["n_lags"]}')

for key,val in r[4].items():
print(f' Critical value {adjust(key)} = {round(val, 3)}')

if p_value <= signif:
print(f" => P-Value = {p_value}. Rejecting Null Hypothesis.")
print(f" => Series is Stationary.")
else:
print(f" => P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
print(f" => Series is Non-Stationary.")
# ADF Test on each column
for name, column in df_train.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Augmented Dickey-Fuller Test on "DayOfWeek"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -1.7257
No. Lags Chosen = 16
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.4179. Weak evidence to reject the Null Hypothesis.
=> Series is Non-Stationary.


Augmented Dickey-Fuller Test on "Sales"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -5.8957
No. Lags Chosen = 20
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Customers"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -1.5412
No. Lags Chosen = 20
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.513. Weak evidence to reject the Null Hypothesis.
=> Series is Non-Stationary.


Augmented Dickey-Fuller Test on "Open"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -1.5718
No. Lags Chosen = 20
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.4977. Weak evidence to reject the Null Hypothesis.
=> Series is Non-Stationary.


Augmented Dickey-Fuller Test on "Promo"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -3.0526
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0303. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "StateHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -13.7885
No. Lags Chosen = 3
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.568
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "SchoolHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -7.2939
No. Lags Chosen = 12
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.

Conclusion Clearly, we can see from the above results that the following columns are not stationary

  • DayOfWeek
  • Customers
  • Open

Working on making the above series stationary

# 1st difference
df_differenced = df_train.diff().dropna()
# ADF Test on each column of 1st Differences Dataframe
for name, column in df_differenced.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Augmented Dickey-Fuller Test on "DayOfWeek"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -8.0782
No. Lags Chosen = 15
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Sales"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -14.7638
No. Lags Chosen = 19
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Customers"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -9.6245
No. Lags Chosen = 19
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Open"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -8.9815
No. Lags Chosen = 19
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "Promo"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -10.3626
No. Lags Chosen = 21
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "StateHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -14.8724
No. Lags Chosen = 11
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.


Augmented Dickey-Fuller Test on "SchoolHoliday"
-----------------------------------------------
Null Hypothesis: Data has unit root. Non-Stationary.
Significance Level = 0.05
Test Statistic = -10.3797
No. Lags Chosen = 17
Critical value 1% = -3.438
Critical value 5% = -2.865
Critical value 10% = -2.569
=> P-Value = 0.0. Rejecting Null Hypothesis.
=> Series is Stationary.

Conclusion

  • We can see that after 1st order differencing series is stationary
  • Now we can proceed ahead with these data sets

5.2.6 How to Select the Order (P) of the VAR model

To select the right order of the VAR model, we iteratively fit increasing orders of the VAR model and pick the order that gives a model with the least AIC.

Though the usual practice is to look at the AIC, you can also check other best fit comparison estimates of BIC, FPE, and HQIC.

model = VAR(df_differenced)
for i in [1,2,3,4,5,6,7,8,9]:
result = model.fit(i)
print('Lag Order =', i)
print('AIC : ', result.aic)
print('BIC : ', result.bic)
print('FPE : ', result.fpe)
print('HQIC: ', result.hqic, '\n')
Lag Order = 1
AIC : 9.109819662826853
BIC : 9.409162116974874
FPE : 9043.693676615094
HQIC: 9.22418284061565

Lag Order = 2
AIC : 8.088501294529548
BIC : 8.65026368640934
FPE : 3256.874006536901
HQIC: 8.303132943729961

Lag Order = 3
AIC : 7.598803323087045
BIC : 8.423449331567111
FPE : 1995.9440548408095
HQIC: 7.9138913401013005

Lag Order = 4
AIC : 6.531762270987129
BIC : 7.619756924405895
FPE : 686.7162145511694
HQIC: 6.947495112929074

Lag Order = 5
AIC : 4.475519649993162
BIC : 5.827329331585696
FPE : 87.8670316116561
HQIC: 4.992086336973632

Lag Order = 6
AIC : 2.504938582041416
BIC : 4.121031035393965
FPE : 12.249196366893853
HQIC: 3.122528699465392

Lag Order = 7
AIC : 1.8197855375197989
BIC : 3.700629872052314
FPE : 6.175618235608594
HQIC: 2.538589238395451

Lag Order = 8
AIC : 1.7206067032204424
BIC : 3.8666733996972695
FPE : 5.594765938672929
HQIC: 2.5408147104801135

Lag Order = 9
AIC : 1.6317344199792654
BIC : 4.043495336048177
FPE : 5.121642213442159
HQIC: 2.5535380288124863

We can use the model summary method to find the optimal values

x = model.select_order(maxlags=50)
x.summary()

Conclusion

  • At lag 13 we get the best result building model with lag 13
model_fitted = model.fit(13)
model_fitted.summary()
Summary of Regression Results
==================================
Model: VAR
Method: OLS
Date: Mon, 12, Apr, 2021
Time: 10:53:07
--------------------------------------------------------------------
No. of Equations: 7.00000 BIC: 4.63725
Nobs: 886.000 HQIC: 2.48807
Log likelihood: -8669.23 FPE: 3.20022
AIC: 1.15796 Det(Omega_mle): 1.60267
--------------------------------------------------------------------
# Get the lag order
lag_order = model_fitted.k_ar
print(lag_order)

# Input data for forecasting
forecast_input = df_differenced.values[-lag_order:]
forecast_input
13


array([[ 1.000e+00, -4.256e+03, -6.020e+02, -1.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[-6.000e+00, 4.388e+03, 5.680e+02, 1.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, -4.770e+02, -4.100e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, 2.770e+02, -1.000e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, 3.000e+00, 2.400e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, 1.040e+02, 2.600e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, 5.380e+02, -7.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, -4.833e+03, -5.600e+02, -1.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[-6.000e+00, 8.239e+03, 8.280e+02, 1.000e+00, 1.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, -1.587e+03, -1.450e+02, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, -2.990e+02, -1.000e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, -5.660e+02, -8.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00],
[ 1.000e+00, 6.860e+02, 4.500e+01, 0.000e+00, 0.000e+00,
0.000e+00, 0.000e+00]])

5.2.7 Model evaluation

Forecasting with the results obtained

# Forecast
fc = model_fitted.forecast(y=forecast_input, steps=42)
df_forecast = pd.DataFrame(fc, index=train_store_46.index[-42:], columns=train_store_46.columns)
df_forecast
png
# Invert the transformation to get the real forecast
def invert_transformation(df_train, df_forecast, second_diff=False):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
for col in columns:
# Roll back 2nd Diff
if second_diff:
df_fc[str(col)] = (df_train[col].iloc[-1]-df_train[col].iloc[-2]) + df_fc[str(col)].cumsum()
# Roll back 1st Diff
df_fc[str(col)] = df_train[col].iloc[-1] + df_fc[str(col)].cumsum()
return df_fc
df_results = invert_transformation(df_train, df_forecast, second_diff=False)# Plotting to see the result obtained
fig, axes = plt.subplots(nrows=int(len(train_store_46.columns)/2), ncols=2, dpi=150, figsize=(10,10))
for i, (col,ax) in enumerate(zip(train_store_46.columns, axes.flatten())):
df_results[col].plot(legend=True, ax=ax).autoscale(axis='x',tight=True)
df_test[col][-42:].plot(legend=True, ax=ax);
ax.set_title(col + ": Forecast vs Actuals")
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.spines["top"].set_alpha(0)
ax.tick_params(labelsize=6)

plt.tight_layout();
png

Conclusion

  • Clearly, we can see that we are able to forecast sales quite accurately
  • Also, No of customer coming to shops are also forecasted accurately

Let’s see other metrics for our predictions

Manish Poddar
Manish Poddar

Written by Manish Poddar

Machine Learning Engineer at AWS | Generative AI | MS in AI & ML, Liverpool John Moores University | Solving Data Problem

No responses yet