# About¶

**Entry:** Data Science society Datathlon 2018
**Case:** SAP Case
**Dataset:** available here
**Authors:** Hristo Piyankov (hpiyankov@gmail.com)
**Notes:** not all caclulations and graphs are carried out in python, due to time constraints

# Business understanding¶

Goal of the study is to udnerstand drivers behind sales up-lift with relation to the company's own pricing strategies, promostions and competitors.'Identify possible action points for optimisation of sales and revenue.

# Data understading¶

## Sample¶

Sample from the provided dataset with basic statistics is shown below

```
import pandas as pd
```

```
path = "/home/hpiyankov/Documents/Datathlon/Datathlon 2018/"
df = pd.read_excel(path+"data.xlsx")
```

```
df.info()
```

```
df.head()
```

## Observations¶

Data consist of company's price and sales volumes data, on a weekly level, promotion information, plus information for 7 competiors and their pricing. Data is rather scarce at only 146 observations.

# Data preparation¶

## Basic modifications¶

Needed data modifications:

- Competitors data is coded as 0 for missing data. Those need to be recoded to nulls and proper aggregations done.
- Type of promition to be recoded to dummy variables.We have 5 promitions and months with no promitions, thus 5 dummy variables are created:PROMOTION_A...PRMOMOTION_E.

## Outliers and normalization¶

```
from IPython.display import Image
from IPython.core.display import HTML
Image(filename = path+"BoxPlt.png", width=800, height=600)
```

The data is not signifficantly dispersed and the averages of the boxplots, are well within 3-4x of each other. There are a few outliers, but they are not too far off the averages. Based on this, neither normalization, nor outlier handling is really nessecary.

## Sales fluctuations (looking for seasonality)¶

```
Image(filename = path+"Volume.png", width=1000, height=600)
```

**Legend:**

- Promotion: (pink) periods, when there was any promotion active.
- Volume: (blue) sales volume during this period
- Volume per price: (orange) (sales volume) x (price for the period)
- Double weighted: (gray) (sales volume) x (price for the period) x MIN(competitor price for the period).

**Intepretation:**

- Price does not appear to be a signifficant factor when it comes to sales volumes. The sales and the price weighted sales are almost identical
- Competitors minial price or the lack thereof (surprisingly) also does not seem to play a major role in the sales volume. One notable exception are weeks 88 - 121 where there is a notable inversion in te trend of competitor prices - it does lead to sales increase, which however is barely noticable.
- The major factor driving sales fluctuations appears to be the prmotions. The positive takeaway here is that they do not seem to be overwhelmingly price-related (or at least to a very low extend)
- Further analysis of the sales fluctionations needs to split the data set between periods of promitions and non promotions

**Correlations:**

Sales | Price | Min Competitor price | ||
---|---|---|---|---|

Price | -0.71 | |||

Min Competitor price | -0.06 | 0.16 | ||

Has promotion | 0.53 | -0.63 | 0.03 |

**Intepretation:**

- It is safe to say that Sales fluctuations are tightly correlated with the price, which in turn is mainly driven by the promotions.
- Competitors do not appear to be a signifficant factor. Their price is not correlated with the Sales at all, and has bary any correlation with out price. Last but nto least they do not appear to react in any way to our promortions.
- The major factor driving sales fluctuations appears to be the prmotions. The positive takeaway here is that they do not seem to be overwhelmingly price-related (or at least to a very low extend)
- Further analysis of the sales fluctionations needs to split the data

```
Image(filename = path+"SalesByType.png", width=1000, height=600)
```

**Intepretation:**

- Blue dots = Sales volume during no promotions; Orange dots = Sales volume during promotions
- The frequent switch between promotion and non-promotion periods, makes it hard to judge weather we can talk about seasonality or not. Overall, even if seasonality is present, it is very hard to detect and thus of no signifficant consequence.
- An interseting observaton is the sharp jump and subsequent sharp drop after each prmotion start. This is worth examining further.

```
Image(filename = path+"SalesByPromotionWeek.png", width=1000, height=600)
```

**Intepretation:**

- Blue dots = Sales volume during the first week of a promotion; Orange dots = Sales volume during third week of a promotion; Gray dots = Sales volume during 3+ week of a promotion
- When we further break down the promotions by week running, it becomes quickly apparent that the first week of the prmotion is almost always the "star" followed by a bottom in week 2. There is some fluctuation in week 3, but again it is rather on the down side.
- The next question here is, does it make sense to run a promotion past week 1 ?

```
Image(filename = path+"SalesByPromotionWeek2.png", width=1000, height=600)
```

**Intepretation:**

- Horisontal axis = Running promotion or not; Bar colors = consecutive week of the prmotion; Bar height = average sales volume.
- It quickly becomes apparent that week 1 is always the strongest week in any promotion.
- Sales in weeks 3+ for promotion types A and B are almost indistinguishable from having no promotion at all.
- Promotion types C, D and E still see some uplift in weeks 3+
- All this drives the questions - does it financially make sense to run promotions past week 1?

```
Image(filename = path+"PriceWeightProm.png", width=1000, height=600)
```

**Intepretation:**

- Horisontal axis = Running promotion or not; Bar colors = consecutive week of the prmotion; Bar height = average sales volume; Yellow line: average sales volume during non-promotion period.
- Answer to the above question is: if there is any additional costs to running a prmotion, apart from the price reduction, it generally does not make sense to run it past week 1.
- Even if there is no additonal cost, any form of Sales usually has implicit costs, like cost of goods, cost of FTEs etc. Any reduction in prices, generally drives the margins thin, and unless there is good enough incentive (a lot more sales) it does not make sense to reduce the prices.

**Final observations**

- Majority of the variable enginnering efforts need to be concentrated around our own price fluctuations and promotions.
- Competitor activity is rather insignifficant, so the variables created, which are related to it can be minimal.
- The target variable, which will be optimized needs to reflect both the sales and the price. Given that the price is driven by promotions, an open topic remains what is the cost of those promotions? Is it only the reduction in the price? are there any advertising costs associated with it? We need to know this because, past week 1 of the promotions the effects wear off rapidly, in case there is any running cost associated with running the promotions past week 1, it might make sense not to keep them.

## Variable enginnering¶

Based on the analysis and observations above, below is the final list of variables used for the modelling part.

Variable | Type | Description | Reasoning | Imputation of missing variables | |
---|---|---|---|---|---|

WSales | Target, continious | Sales X Price | Our target variable needs to reflect the cost of running a promotion | None | |

PROMOTION_A...PROMOTION_E | Binary | 5 Dummy variables for all of the campaigns + nulls | None | ||

HAS_PROMOTION | Binary | Flag if the customer is in any prmotion | None | ||

TIME_SINCE_LAST_PROM | Continious | Number of days between current and previous promotion ended | To track if spacing between prmotions plays a role | Fixed value: -1 | |

LAST_PROM_DUR | Conitious | Duration of the last prmotion | Monitor impact of previous promotion duration on the current sales | Fixed value: 0 | |

PROMOTION_WEEK | Continious | Consecutive week in which the promotion is running | None | ||

PROMOTION_MAV5 | Continious | Moving average of HAS_PROMOTION for the last 5 weeks | Measure the customer's saturation to promotions | Constant: 0, only applicable to first 4 weeks. | |

PROMOTION_MAV10 | Continious | Moving average of HAS_PROMOTION for the last 10 weeks | Measure the customer's saturation to promotions | Constant: 0, only applicable to first 4 weeks, then MAV 5-9 | |

PRICE | Continious | Original price variable | None | ||

PRICE_CHANGE | Continious | Current price / previous period price -1 | One major effect of prmotions is change in price, we want to capture the immediate change here | Constant: 0, only applicable to first week. | |

PRICE_SLOPE5 | Continious | Slope of the price for the last 5 periods | Majority of capmpaigns run for an average of 3 days. We want to capture the effects of having multiple campaigns close to one annother, as opposed newer campagins. | Constant 0, only applicable to first 4 weeks. | |

PRICE_SLOPE10 | Continious | Slope of the price for the last 10 periods | Same as above, just a longer period | Constant: 0, only applicable to first 4 weeks, then MAV 5-9 | |

HAS_COMPETITOR | Binary | Flag if currenty any competitor has an active price | |||

COMPETITOR_MIN | Continious | MIN(Competitor 1...7 price) | For price sensitive customers | 5 period forward moving average | |

COMPETItOR_AVG | Continious | AVG(Competitor 1...7 price) | Median would make more sense but data is too scarce. | 5 period forward moving average | |

PRICE_TO_COMP_MIN | Continious | Our price / MIN(Competiror price) -1 | Monitor spread to competitors | None | |

PRICE_TO_COMP_AVG | Continious | Our price / AVG(Competiror price) -1 | Monitor spread to competitors | None | |

COMP_PRICE_CHANGE_MIN | Continious | Current competitor min price / previous period competiotr min price -1 | How do competitor discouns affect us | Constant: 0, only applicable to first week. | |

COMP_PRICE_CHANGE_AVG | Continious | Current competitor avg price / previous period competiotr avg price -1 | How do competitor discouns affect us | Constant: 0, only applicable to first week. | |

COMP_SLOPE5_MIN | Continious | Slope of the minimum competior price for the last 5 periods | Monitor the rate of the competior price change | Constant 0, only applicable to first 4 weeks. | |

COMP_SLOPE5_AVG | Continious | Slope of the minimum competior price for the last 5 periods | Monitor the rate of the competior price change | Constant 0, only applicable to first 4 weeks. |

## Final dataset¶

```
df2 = pd.read_excel(path+"final.xlsx")
```

```
df2.info()
```

```
df2.tail(10)
```

```
print(df2.describe())
```

Out of this summary, we can see that HAS_COMPETITOR is 1 for all values, so go ahead and drop it.

```
df3 = df2.drop("HAS_COMPETITOR",axis=1)
```

```
df3.shape
```

# Modeling¶

## Overview¶

Our main goal in the modeling part is not nessecarily to obtain the "perfect model" for predicting the sales volume, but rather to udnerstand how the different variables interact with each other in order to best structure future promotions and pricing.

Doing so, we need to use models which offer full transparency of the modeling process as opposed to "back box" model appraches.

** Good candidates:**

Linear regression and it's "derivatives" (GLM, Lasso, Least squares, Polynomial, Feed-forward/Backwards/Stepwise)

Decision trees

Gradient boosting
*(Note: Association analysis (Market basket) is an interesting candidate, however the data processing for it is substatially different that all other models)*

**"Maybe" candidates:**

Random forest

**Bad candidates:**

Neural networs

Support vector machines

Ensemble models

## Pre-processing¶

We might want to remove highy correlated variables up-front, to simplyfy later processing

```
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(14, 11))
corr = df3.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
square=True, ax=ax)
```

It appears we do have some highly correlated variables, so let's look in detail (looking at cutoff above 0.75 correlation in either way)

```
corr["Index"] = corr.index
melted = corr.melt(id_vars=["Index"])
melted[(melted["Index"] != melted["variable"]) & (abs(melted["value"]) > 0.75)]
```

The overall COMPETITOR_MIN appears to be quite simmilar to the COMPETITOR_AVG, so here we will drop all variables related to COMPETITOR_AVG (under the assumption, that COMPETITOR_MIN makes more business sense).

Also HAS_PROMOTION is highly correlated to PROMOTION_WEEK (as it should be). We will keep only PROMOTION_WEEK, as it represents more information.

```
df4 = df3.drop(["HAS_PROMOTION","PROMOTION_MAV10","PRICE_TO_COMP_AVG","COMPETITOR_AVG","COMP_PRICE_CHANGE_AVG","COMP_SLOPE5_AVG"],axis=1)
```

```
df4.tail(10)
```

## Training and validation sets¶

Since the dataset is quite small and we don't intend to do any hyperparameter tunning or manual pruning of decision trees, we will generate only train and validation sets (no cross validation) with a test set of 20%.

Although it could be contravertial, we will append the data to itself multiple times. The reason for this is that the small data set, can generate huge variance, when splitting by test/train. We will avoid this by arificially multpying the data.

```
from sklearn import datasets
from sklearn.model_selection import train_test_split
```

```
for k in range(2):
df4 = df4.append(df4)
df4.shape
```

```
target = df4["Wprice"]
features = df4.drop("Wprice",axis=1)
```

```
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)
```

## Models¶

### Regression tree¶

```
import pydotplus
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import collections
```

**Comments:** Since our data is quite small, let's go ahead and play with with tree settings.

```
max_score = 0
splits = 0
leafs = 0
for split in range(30,500,20):
for leaf in range(20,500,10):
clf = tree.DecisionTreeRegressor(min_samples_split=split, min_samples_leaf=leaf,random_state=0,max_depth=4)
clf = clf.fit(X_train,y_train)
score = clf.score(X_test, y_test)
if score > max_score:
max_score = score
splits = split
leafs = leaf
print(f'Best R^2 ={max_score} was achieved at {splits} samples per split and {leafs} minimum leafs')
```

```
clf = tree.DecisionTreeRegressor(min_samples_split=splits, min_samples_leaf=leafs,
random_state=0,max_depth=4)
clf = clf.fit(X_train,y_train)
```

```
# Visualize data
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
feature_names=X_train.columns,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png(), width=1000, height=600)
```

**Comments:** Makority of the predictors are focused around either the price change, the slope of the price change or the time since last promotion.

### Least Squares regression¶

```
import statsmodels.api as sm
from scipy import stats
```

```
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())
```

**Comments:** Let's see if we can drop some of the variables with high P score in order to obtain a better result.

```
X3 = X2.copy(deep=True)
while True:
loop_est = sm.OLS(list(y_train), X3)
result = loop_est.fit()
pvals = result.pvalues.sort_values(ascending=False)
location = 0
if pvals.index[0] == "const":
location = 1
if pvals.iloc[location] < 0.05:
break;
X3 = X3.drop(pvals.index[location],axis=1)
```

```
est_2 = sm.OLS(y_train, X3)
est2_2 = est_2.fit()
print(est2_2.summary())
```

**Comments:** Although all the variables now are highly signifficnt, and the overall preiciton qulity is pretty much the same. It is worth noting that again the price change is a strong predictor.

Let's go ahead and try to improve the R^2 by adding polynomials, to see if there will be additional meaningfull result.

### Polynomial regression¶

```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
```

```
# create a Linear Regressor
lin_regressor = LinearRegression()
# pass the order of your polynomial here
poly = PolynomialFeatures(2)
# convert to be used further to linear regression
X_transform = poly.fit_transform(X_train)
X_tst = poly.fit_transform(X_test)
# fit this to Linear Regressor
lin_regressor.fit(X_transform,y_train)
# get the predictions
y_preds = lin_regressor.predict(X_tst)
score = lin_regressor.score(X_transform, y_train)
print(score)
Score2 = lin_regressor.score(X_tst, y_test)
print(Score2)
```

**Comments:** Obviously the train set currenty is signifficantly overfitted+ we get an R^2 = 1 which is obviously an issue. Let's try to fix that by going stepwise.

```
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(X_train.columns,p) for p in poly.powers_]]
output_df = pd.DataFrame(X_transform, columns = target_feature_names)
```

```
no_blank = output_df.drop("",axis=1)
Xtr2 = sm.add_constant(no_blank)
Xtr2.head()
```

```
blank_df = Xtr2["const"]
iterable = Xtr2.drop("const",axis=1)
lastR2 = 0
col_to_add = ""
for m in range(10):
for k in iterable.columns:
test_df = pd.concat([blank_df,iterable[k]],axis=1)
est = sm.OLS(list(y_train), test_df)
est2 = est.fit()
if est2.rsquared > lastR2:
lastR2 = est2.rsquared
col_to_add = k
test_df = test_df.drop(k,axis=1)
blank_df = pd.concat([blank_df,iterable[col_to_add]],axis=1)
loop_est = sm.OLS(list(y_train), blank_df)
result = loop_est.fit()
pvals = result.pvalues.sort_values(ascending=False)
location = 0
if pvals.index[0] == "const":
location = 1
if pvals.iloc[location] > 0.05:
blank_df = blank_df.drop(pvals.index[location],axis=1)
est = sm.OLS(list(y_train), blank_df)
est2 = est.fit()
print(est2.summary())
```

**Comments:** We end up with a (relatively) small enough subset of signifficant parameters, on a model having acceptable R2 value. However with the polynomial terms, we are not sure that ll the variables have compareble ranges, so let's go ahead and fix that.

```
descr = blank_df.describe().transpose()
descr["range"] = descr["max"]-descr[min]
mean_range = np.mean(descr["range"])
weights = mean_range/descr["range"]
coeffs = pd.concat([result.params, weights],axis=1)
final_weights = (coeffs[0]*coeffs["range"]).sort_values()
print(final_weights)
```

**Comments:** Resuls of the Polynomial stepwise regression are debatable and might need further, in-depth analysis. The results suggest that:

- Any price change is a good one.
- Promotion D success is mainly related to the actions of our competitors
- Promotions A and E play well, when there was no other promotion in the near past

# Results interpretation¶

**Summary:** we have explored the several approaches to determining the major drivers and effects of promotional campagins carriend out in the dataset. Both exploratory analysis, simple decision trees and various regressions suggest promotional activities with short duration, and steep price changes are what drives the sales.

# Results application¶

Strategy for sales, based on the result suggests:

## 3 thoughts on “CASE SAP, TEAM 31415”

Great work team!

The most difficult part of this challenge is to understand the data, create new features and rerun the predictive models till you achieve a good accuracy.

As you may mentioned if you run a predictive model with the initial dataset you will get an extremely low modelling accuracy.

I will vote based on the below criteria:

1. business understanding

2. feature engineering

3. modelling accuracy

4. insights & final results

You achieved a fair but not great modeling accuracy and you created a number of new features based on your good understanding of the data and the business case.

You could increase the accuracy of the model by implementing a base price algorithm and then by taking the % of difference between the actual and the base price you could extract the weekly promotional price reduction and use it as input parameters for your regression model. This was your pain point on a very nice job

Congratulations to team 31415 for taking up a difficult challenge primarily because, as some might say, the data size for this problem was very small. However, the best part was that problem was well defined.

I thoroughly enjoyed reading a very crisp flow of ideas and implementation on the case. I liked that the team though of using multiple algorithms but discounted that because of limitations of the algorithm given the data.

The only suggestion I’ll like to provide is that the team should have thought twice on using train_test_split at (80,20). Typically when the observations are so less (especially in the medial research studies) the choice is one-vs-all or leave-one-out. However, neither of those methods could have guaranteed a significant change in the model response.

Best of luck.

Hi team! In brief: I like your different approach for analysis, ideas for derived variables and also appreciate the comparison of diff. modelling techniques. 🙂

Would be happy to see continuation of your work here.