Rossman store sales forecast

Introduction

Dirk Rossmann GmbH is Germany's second-largest drug store chain (after dm-drogerie markt), with over 4000 stores in Europe.

You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the “Sales” column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

This data was obtained from Kaggle here

For folks who want to repeat this use case, the data is stored as a direct download here

The first few rows of the raw csv data looks like this:

"Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday","SchoolHoliday"
1,5,2015-07-31,5263,555,1,1,"0","1"
2,5,2015-07-31,6064,625,1,1,"0","1"
3,5,2015-07-31,8314,821,1,1,"0","1"
4,5,2015-07-31,13995,1498,1,1,"0","1"
5,5,2015-07-31,4822,559,1,1,"0","1"
6,5,2015-07-31,5651,589,1,1,"0","1"
7,5,2015-07-31,15344,1414,1,1,"0","1"
8,5,2015-07-31,8492,833,1,1,"0","1"
9,5,2015-07-31,8565,687,1,1,"0","1"

Assumptions

For this PoC, we will simplify the dataset by dropping a few columns. Many of the following assumptions may be inaccurate, but this is a PoC, and we want to exercise the provided code samples end-to-end. Let us try to forecast the Sales column for each store by Store ID, and Promo. We will first and drop:

  • DayOfWeek assuming most of the days have similar sales until the weekend
  • Customers assuming this is number of customers, and we will not have access to this value until after the day
  • Open assuming it the stores are open except during holidays, and we have a column that takes care of that
  • SchoolHoliday, since to prove this out, we will be considering other columns with dynamic values, like Promo
  • StateHoliday, since the unique values in the column are [0,a,b,c]

Step 1 - Preprocessing

Step 1.1 - Drop and rearrange columns

From the code on this page on Select, drop or extract Columns, copy the following line:

awk -F "|" '{ print $1 $3 $5 }' Folder/in*csv > outfile.csv

and adapt it for our use case:

awk -F"," '{ {gsub(/\"/,"")}; print $3","$1","$4","$7}' rossman-sales.csv > outfile.csv

Now, the data in outfile.csv looks like this:

Date,Store,Sales,Promo
2015-07-31,1,5263,1
2015-07-31,2,6064,1
2015-07-31,3,8314,1
2015-07-31,4,13995,1
2015-07-31,5,4822,1
2015-07-31,6,5651,1
2015-07-31,7,15344,1
2015-07-31,8,8492,1
2015-07-31,9,8565,1

Note that we also used the find-and-replace from the same link using awk to replace all the double-quotes (") by an empty string (basically remove all double-quotes)

Step 1.2 - Prepare data for DeepAR

From this link, you can create a new python file with the following contents:

import pandas as pd
import jsonlines
from sklearn import preprocessing
le = preprocessing.LabelEncoder()


series = pd.read_csv('outfile.csv', parse_dates=[0], index_col=0)
series.sort_index(inplace=True)

target_column = 'Sales'
group_column = 'Store'

for col in series.columns:
    if col !=target_column:
        series[col] = le.fit_transform(series[col])

if series[group_column].nunique()==1:
    a = [series]
else:
    a = [v for k, v in series.groupby(group_column)]

out = []

for i in range(len(a)):
    dynamic_feat = []
    cat = []
    for col in a[0].columns:
        if col == target_column:
            target = a[0][col].values.tolist()
            start = str(a[0].index[0])

        else:
            if a[0][col].nunique()>=2: #if 2 or more values, add as dynamic feature
                dynamic_feat.append(a[0][col].values.astype(float).tolist())
            elif a[0][col].nunique()==1: #if 1 value, add as category
                cat.append(int(a[0][col][0]))
    out.append({'start':start, 'target':target, 'cat':cat, 'dynamic_feat':dynamic_feat})
    
with jsonlines.open('train-data.jsonl', mode='w') as writer:
    writer.write_all(out)

The only things we changed are the name of the input file (‘outfile.csv’), the name of the target column and group column.

Save it as dataprep.py, and run python dataprep.py

After running this file, you should get an output file called train-data.jsonl which looks like:

{"start": "2013-01-01 00:00:00", "target": [0, 5530, 4327, 4486, 4997, 0, 7176, 5580, 5471, 4892, 4881, 4952, 0, 4717, 3900, 4008, 4044, 4127, 5182, 0, 5394, 5720, 5578, 5195, 5586, 5598, 0, 4055, 3725, 4601, 4709, 5633, 5970, 0, 7032, 6049, 6140, 5499, 5681, 5370, 0, 4409, 4015, 4252, 4241, 4809, 6154, 0, 6407, 5386, 5660, 5261, 5000, 5237, 0, 4038, 3794, 4558, 4676, 4611, 5350, 0, 7675, 6300, 5973, 5637, 5853, 5578, 0, 4949, 3853, 4341, 5108, 4925, 5003, 0, 7072, 6563, 5598, 5179, 5506, 5603, 0, 6729, 6686, 6660, 7285, 0, 7132, 0, 0, 5484, 4625, 4293, 4390, 5075, 0, 6046, 5514, 4903, 4366, .......

Step 2 - Training

From this link on training with DeepAR, we first upload our train-data.jsonl file to S3, and make a note of this path. You can also click the file on the S3 console, and hit the Copy Path button.

Since this is daily data, change the ‘1H’ frequency in the hyperparameters to ‘1D’.

Again from this link on training with DeepAR, copy all the code cells in a python file, or use a local jupyter notebook (make sure you have the right permissions and have run aws configure); you can also run the following cells in a SageMaker notebook.

In this example, we use a SageMaker notebook and add these cells:

At the end of training, you will see the following output:

.
.
.
2020-03-12 21:42:57 Uploading - Uploading generated training model
2020-03-12 21:42:57 Completed - Training job completed
Training seconds: 156
Billable seconds: 156

Step 3 - Deploy model

From this link on deploying a model trained with DeepAR, do

predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

To forecast, follow instructions in the same link:

Done!


Related content: