SAP Case Team Mentors:
- Agamemnon Baltagiannis
SAP Case Team:
- Abderrahim Khalifa | Morocco
- Andrei Deusteanu ([email protected]) | Romania
- Julian Borisov ([email protected]) | Bulgaria
- Sergey Vichev ([email protected]) | Bulgaria
- Slavy Olov ([email protected]) | Bulgaria
Used tools:
- R, Python, KNIME, MS Excel
Business Understanding
Business Objectives Definition:
- Sector: FMCG
- Subject: Price optimisation and promotional effectiveness
- Goal: analyze the impact of price reduction and promotions on volume of sales by also taking into account competitor’s prices.
- Objectives:
- Identify the volume uplift drivers;
- Measure the promotional effectiveness;
- Calculate the cannibalisation effect from the main competitors of the target product.
- Expected output: produce insights and present them in a meaningful way that will help a potential customer make decisions on price strategy or future promotions.
Project Plan
- Each team member worked in different tools to understand and explore the data, and based on findings one or 2 tools are to be selected in order to answer the business objectives.
Data Understanding
Data description:
- Initial Dataset: 3 years of weekly data is provided containing 11 features:
- (1) Week (1-146)
- (2) Volume of Sales of targeted product
- (3) Actual Price of targeted product
- (4-10) Prices of 7 competitors (Competitor1_Price, Competitor2_Price, etc.)
- (11) Type of promotion of the targeted product (n/a, A, B, C,etc.)
Data Exploration
- In MS Excel:
Through a basic visualisation of Volume of Sale and Actual Price a dependence may be established between the price decrease and volume uplift. In the same time matters if the week of promotion is second or third consecutive in a row.
Another visualisation of the initial dataset shows that presence of seasonality effect might be rational to analyse in the future model.
- In Python:
library(DataExplorer)
plot_bar(Data)
round(100*table(Data$TYPE_OF_PROMOTION)/length(Data$TYPE_OF_PROMOTION),1)
##
## A B C D E
## 42.5 8.9 9.6 21.2 12.3 5.5
The most frequent promotion type is C. However 42.5% of the data is not associated with a promotion
#week number is not relevant to be looked at in a histogram
plot_histogram(Data[,2:11])
It is possible that price of competitors 5, 6, 7 is likely irrelevant as there are not so many cases of them appearing
plot_boxplot(Data[,c("VOLUME_OF_SALES","TYPE_OF_PROMOTION"),with=F],by="TYPE_OF_PROMOTION")
It looks like having a promotion has on average a higher volume of sales. Let’s check this more specifically.
Data[,hasPromotion:=ifelse(TYPE_OF_PROMOTION=="","0","1")]
plot_boxplot(Data[,c("VOLUME_OF_SALES","hasPromotion"),with=F],by="hasPromotion")
o make sure the hypothesis holds and the volume sales is actually higher when a promotion occurs we’ll do statistical hypothesis testing. Looking at the distributions of the 2 groups which are not normal and are independent of each other (each instance, in this case week appears in only 1 group) is the Mann-Whitney U test (also called the Wilcoxon rank sum test) ((Sheskin) Hanbook of Parametric and Nonparametric Statistical Procedures [3rd ed], p142)
x<-wilcox.test(Data$VOLUME_OF_SALES~Data$hasPromotion)
x$p.valuewilcox.test(Data$VOLUME_OF_SALES~Data$hasPromotion)
1.825476e-12
Given that the p-value is lower than 0.05 we can say that the difference between the 2 groups (with or without promotion) is statiscally significant. As we have seen from the graphs promotions tend to yield higher sales on average.
competitors<-c()
for(i in 1:7){
competitors<-c(competitors,paste("COMPETITOR",i,"_PRICE",sep=""))
}
Let’s have a look at the relationship between the promotion types and the prices of the competitors
plot_boxplot(Data[,c(competitors,"TYPE_OF_PROMOTION"),with=F],by="TYPE_OF_PROMOTION")
It seems that there is some relationship between the promotion type and the price of competitor 1. For the rest it does not appear to be much of a relationship. Let’s look at it in more detail.plot_boxplot(Data[,c("COMPETITOR1_PRICE","hasPromotion"),with=F],by="hasPromotion")
It looks that when this business has a promotion competitor 1 also tends to have a lower price as well. Let’s use the same statistical test as above to check if the difference is actually significant
wilcox.test(Data$COMPETITOR1_PRICE~Data$hasPromotion)$p.value
9.853181e-08
Yes there is a statiscally significant difference. It seems that on average when we have a promotion the price of competitor 1 is 5% lower.
For further exploration let’s check the relation between sales and the rest of the numerical variables.
pairs(data=Data,~VOLUME_OF_SALES+ACTUAL_PRICE+COMPETITOR1_PRICE,main="Scatter Plot Matrix",upper.panel=NULL)
It seems that there is a strong inverse relationship between the actual price and the volume of sales - an increase in the actual price is correlated with a decrease in sales. The relationship between the volume of sales and price of competitor 1 is moderate and negative - an increase in the price of competitor 1 is correlated with a decrease in sales (WEEEEIRD??). There is a rather strong relationship between the actual price and price of competitor 1
Data Preparation
In Excel
Features / variables were extracted out of the existing data. Mainly, these are changes in volumes and prices over time, between periods and promotions. These variables are used to create the model.
Variable Name | Description |
Week | original input |
VOLUME_OF_SALES | original input |
ACTUAL_PRICE | original input |
COMPETITOR1_PRICE | original input |
COMPETITOR2_PRICE | original input |
COMPETITOR3_PRICE | original input |
COMPETITOR4_PRICE | original input |
COMPETITOR5_PRICE | original input |
COMPETITOR6_PRICE | original input |
COMPETITOR7_PRICE | original input |
TYPE_OF_PROMOTION | converted and reordered |
change vol | difference of sales volume between last 2 weeks (x2 – x1) |
change vol% | sales volume change divided by previous week’s sales volume (change vol2 / x1) |
change price | difference of price and previous week’s price (p2 – p1) |
change price % | price divided by previous week’s price (p2 / p1) |
diff1 | diff b/w competitor’s price and actual price, if there is value |
diff2 | diff b/w competitor’s price and actual price, if there is value |
diff3 | diff b/w competitor’s price and actual price, if there is value |
diff4 | diff b/w competitor’s price and actual price, if there is value |
diff5 | diff b/w competitor’s price and actual price, if there is value |
diff6 | diff b/w competitor’s price and actual price, if there is value |
diff7 | diff b/w competitor’s price and actual price, if there is value |
week of year | week of the year period, 1-52 |
year | consecutive yearly period |
x-avg | diff b/w volume and the mean volume |
abs dev | absolute diff b/w volume and the mean volume |
chng | flag if x-avg > 40000 |
chng med | flag if volume deviation is higher than half the median (x-avg > median/2) |
year week prom | consecutive year if promotion is on |
prom week | consecutive week of promotion |
promotion flag | unused |
period1-2 | period 2 is when the average weeks b/w promotions are more often |
avg promo chng | average change vol% for current promo and week |
avg price chng | average change price % for current promo and week |
input chng | sales volume divided by previous week’s sales volume (x2 / x1) |
calc chng | avg promo factor times the volume |
avg promo factor | average effect of the promotion and week of promotion |
Modeling
Modelling Technique
We extracted many different variables from the original input, so that we could observe how dynamical changes in resulting data would effect on the sales volume. Those variable showed various importance for changing volume sales. Having them constructed, we moved on with a multiple linear regression, so that we could explore which the significant factors were and at what level would they affect the volumes.
A simple data preparation was done after that by ordering correctly the categories of the few categorical variables, so that the reference category would be a meaningful one.
After that we started testing the analysis output with different variables, until an optimal level of error was achieved.
The final factor selection includes:
ACTUAL_PRICE
COMPETITOR2_PRICE
COMPETITOR3_PRICE
COMPETITOR4_PRICE
COMPETITOR5_PRICE
COMPETITOR6_PRICE
COMPETITOR7_PRICE
change vol
diff2
diff3
diff4
diff5
diff6
diff7
week of year
year
chng=1.0
year week prom
prom week
period1-2=2
avg promo factor
TYPE_OF_PROMOTION (to number)=2
TYPE_OF_PROMOTION (to number)=3
TYPE_OF_PROMOTION (to number)=4
TYPE_OF_PROMOTION (to number)=5
TYPE_OF_PROMOTION (to number)=6
The results are:
The fitting of the predicted values is here: (line graph)
All that was achieved in a simple KNIME workflow:
The coefficients are as follow:
Evaluation
- Volume uplift drivers
According to the model and analysis made average price of the target product and type of promotion are the main volume uplift drivers.
- Measure the promotional effectiveness
According to analysis made promotion C has the biggest effect, according both to the p-value of the Multiple Linear Regression (LMR) Model obtained and the below observation:
- Cannibalisation effect
In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue, or market share of one product as a result of the introduction of a new product by the same producer.
According to the p-values in the LMR Model we can justify which Competitors have impact on the sales volumes of the targeted product i.e. cannibalization effect – being Competitor 4 with the strongest such effect.
5 thoughts on “The SAP Case using KNIME and Multiple Linear Regression Method”
Hi Team,
Very good data understanding. Let’s take this forward to answer some of the specific questions as part of the this case study.
-Nish
Thank you Nish, we are now working on describing the model. Got some good results there too.
drceenish, thank you! We really wanted to make sure we understood the business data correctly. That’s a solid basis for everything after that.
Hi team Honeybadger,
Very well performed EDA!
Nice plots! You have captured the essence of the problem!
Rsqr of more than 0.9 is excellent!
Move it forward and make it to the final goal.
— zenpanik
Great great work team! The most difficult part of this challenge is to understand the data, create new features and rerun the predictive models till you achieve a good accuracy.
As you may mentioned if you run a predictive model with the initial dataset you will get an extremely low modelling accuracy.
I will vote based on the below criteria:
1. business understanding
2. feature engineering
3. modelling accuracy
4. insights & final results
Your approach is on the right path and you managed to create the new features that will help you identify the volume uplift drivers and increase the accuracy of the model. This is a great accuracy for this specific dataset! This comes as a result of your great business understanding of the case and the dataset as well