Team solutions

The SAP Case using KNIME and Multiple Linear Regression Method

3
votes

SAP Case Team Mentors: 

  • Agamemnon Baltagiannis

SAP Case Team:

  • Abderrahim Khalifa                                                                | Morocco
  • Andrei Deusteanu ([email protected]) | Romania
  • Julian Borisov ([email protected])               | Bulgaria
  • Sergey Vichev ([email protected])                        | Bulgaria
  • Slavy Olov ([email protected]                                              | Bulgaria

Used tools: 

  • R, Python, KNIME, MS Excel

Business Understanding

Business Objectives Definition:

  • Sector: FMCG
  • Subject: Price optimisation and promotional effectiveness
  • Goal: analyze the impact of price reduction and promotions on volume of sales by also taking into account competitor’s prices.
  • Objectives:
    • Identify the volume uplift drivers;
    • Measure the promotional effectiveness;
    • Calculate the cannibalisation effect from the main competitors of the target product.
  • Expected output: produce insights and present them in a meaningful way that will help a potential customer make decisions on price strategy or future promotions.

Project Plan

  • Each team member worked in different tools to understand and explore the data, and based on findings one or 2 tools are to be selected in order to answer the business objectives.

Data Understanding

Data description:

  • Initial Dataset: 3 years of weekly data is provided containing 11 features:
    • (1) Week (1-146)
    • (2) Volume of Sales of targeted product
    • (3) Actual Price of targeted product
    • (4-10) Prices of 7 competitors (Competitor1_Price, Competitor2_Price, etc.)
    • (11) Type of promotion of the targeted product (n/a, A, B, C,etc.)

Data Exploration

  • In MS Excel:

Through a basic visualisation of Volume of Sale and Actual Price a dependence may be established between the price decrease and volume uplift. In the same time matters if the week of promotion is second or third consecutive in a row.

Another visualisation of the initial dataset shows that presence of seasonality effect might be rational to analyse in the future model.

  • In Python:
library(DataExplorer)
plot_bar(Data) 
round(100*table(Data$TYPE_OF_PROMOTION)/length(Data$TYPE_OF_PROMOTION),1)

## 
##         A    B    C    D    E 
## 42.5  8.9  9.6 21.2 12.3  5.5 The most frequent promotion type is C. However 42.5% of the data is not associated with a promotion
#week number is not relevant to be looked at in a histogram
plot_histogram(Data[,2:11]) 


 It is possible that price of competitors 5, 6, 7 is likely irrelevant as there are not so many cases of them appearing
plot_boxplot(Data[,c("VOLUME_OF_SALES","TYPE_OF_PROMOTION"),with=F],by="TYPE_OF_PROMOTION")


It looks like having a promotion has on average a higher volume of sales. Let’s check this more specifically.
Data[,hasPromotion:=ifelse(TYPE_OF_PROMOTION=="","0","1")]
plot_boxplot(Data[,c("VOLUME_OF_SALES","hasPromotion"),with=F],by="hasPromotion")




o make sure the hypothesis holds and the volume sales is actually higher when a promotion occurs we’ll do statistical hypothesis testing. Looking at the distributions of the 2 groups which are not normal and are independent of each other (each instance, in this case week appears in only 1 group) is the Mann-Whitney U test (also called the Wilcoxon rank sum test) ((Sheskin) Hanbook of Parametric and Nonparametric Statistical Procedures [3rd ed], p142)
x<-wilcox.test(Data$VOLUME_OF_SALES~Data$hasPromotion)
x$p.valuewilcox.test(Data$VOLUME_OF_SALES~Data$hasPromotion)
1.825476e-12

Given that the p-value is lower than 0.05 we can say that the difference between the 2 groups (with or without promotion) is statiscally significant. As we have seen from the graphs promotions tend to yield higher sales on average.
competitors<-c()
for(i in 1:7){
  competitors<-c(competitors,paste("COMPETITOR",i,"_PRICE",sep=""))
}

Let’s have a look at the relationship between the promotion types and the prices of the competitors

plot_boxplot(Data[,c(competitors,"TYPE_OF_PROMOTION"),with=F],by="TYPE_OF_PROMOTION")

 

It seems that there is some relationship between the promotion type and the price of competitor 1. For the rest it does not appear to be much of a relationship. Let’s look at it in more detail.plot_boxplot(Data[,c("COMPETITOR1_PRICE","hasPromotion"),with=F],by="hasPromotion")

 

It looks that when this business has a promotion competitor 1 also tends to have a lower price as well. Let’s use the same statistical test as above to check if the difference is actually significant

wilcox.test(Data$COMPETITOR1_PRICE~Data$hasPromotion)$p.value

 9.853181e-08

Yes there is a statiscally significant difference. It seems that on average when we have a promotion the price of competitor 1 is 5% lower.

For further exploration let’s check the relation between sales and the rest of the numerical variables.

 

pairs(data=Data,~VOLUME_OF_SALES+ACTUAL_PRICE+COMPETITOR1_PRICE,main="Scatter Plot Matrix",upper.panel=NULL)  It seems that there is a strong inverse relationship between the actual price and the volume of sales - an increase in the actual price is correlated with a decrease in sales. The relationship between the volume of sales and price of competitor 1 is moderate and negative - an increase in the price of competitor 1 is correlated with a decrease in sales (WEEEEIRD??). There is a rather strong relationship between the actual price and price of competitor 1

 

Data Preparation

In Excel

Features / variables were extracted out of the existing data. Mainly, these are changes in volumes and prices over time, between periods and promotions. These variables are used to create the model.

 Variable Name Description
Week original input
VOLUME_OF_SALES original input
ACTUAL_PRICE original input
COMPETITOR1_PRICE original input
COMPETITOR2_PRICE original input
COMPETITOR3_PRICE original input
COMPETITOR4_PRICE original input
COMPETITOR5_PRICE original input
COMPETITOR6_PRICE original input
COMPETITOR7_PRICE original input
TYPE_OF_PROMOTION converted and reordered
change vol difference of sales volume between last 2 weeks (x2 – x1)
change vol% sales volume change divided by previous week’s sales volume (change vol2 / x1)
change price difference of price and previous week’s price (p2 – p1)
change price % price divided by previous week’s price (p2 / p1)
diff1 diff b/w competitor’s price and actual price, if there is value
diff2 diff b/w competitor’s price and actual price, if there is value
diff3 diff b/w competitor’s price and actual price, if there is value
diff4 diff b/w competitor’s price and actual price, if there is value
diff5 diff b/w competitor’s price and actual price, if there is value
diff6 diff b/w competitor’s price and actual price, if there is value
diff7 diff b/w competitor’s price and actual price, if there is value
week of year week of the year period, 1-52
year consecutive yearly period
x-avg diff b/w volume and the mean volume
abs dev absolute diff b/w volume and the mean volume
chng flag if x-avg > 40000
chng med flag if volume deviation is higher than half the median (x-avg > median/2)
year week prom consecutive year if promotion is on
prom week consecutive week of promotion
promotion flag unused
period1-2 period 2 is when the average weeks b/w promotions are more often
avg promo chng average change vol% for current promo and week
avg price chng average change price % for current promo and week
input chng sales volume  divided by previous week’s sales volume (x2 / x1)
calc chng avg promo factor times the volume
avg promo factor average effect of the promotion and week of promotion

 

Modeling

Modelling Technique

We extracted many different variables from the original input, so that we could observe how dynamical changes in resulting data would effect on the sales volume. Those variable showed various importance for changing volume sales. Having them constructed, we moved on with a multiple linear regression, so that we could explore which the significant factors were and at what level would they affect the volumes.

A simple data preparation was done after that by ordering correctly the categories of the few categorical variables, so that the reference category would be a meaningful one.

After that we started testing the analysis output with different variables, until an optimal level of error was achieved.

The final factor selection includes:

ACTUAL_PRICE
COMPETITOR2_PRICE
COMPETITOR3_PRICE
COMPETITOR4_PRICE
COMPETITOR5_PRICE
COMPETITOR6_PRICE
COMPETITOR7_PRICE
change vol
diff2
diff3
diff4
diff5
diff6
diff7
week of year
year
chng=1.0
year week prom
prom week
period1-2=2
avg promo factor
TYPE_OF_PROMOTION (to number)=2
TYPE_OF_PROMOTION (to number)=3
TYPE_OF_PROMOTION (to number)=4
TYPE_OF_PROMOTION (to number)=5
TYPE_OF_PROMOTION (to number)=6

The results are:

The fitting of the predicted values is here: (line graph)

All that was achieved in a simple KNIME workflow:

The coefficients are as follow:

Evaluation

  • Volume uplift drivers

According to the model and analysis made average price of the target product and type of promotion are the main volume uplift drivers.

  • Measure the promotional effectiveness

According to analysis made promotion C has the biggest effect, according both to the p-value of the Multiple Linear Regression (LMR) Model  obtained and the below observation:

  • Cannibalisation effect

In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue, or market share of one product as a result of the introduction of a new product by the same producer.

According to the p-values in the LMR Model we can justify which Competitors have impact on the sales volumes of the targeted product i.e. cannibalization effect – being Competitor 4 with the strongest such effect.

Share this

5 thoughts on “The SAP Case using KNIME and Multiple Linear Regression Method

  1. 0
    votes

    Hi team Honeybadger,
    Very well performed EDA!
    Nice plots! You have captured the essence of the problem!
    Rsqr of more than 0.9 is excellent!
    Move it forward and make it to the final goal.
    — zenpanik

  2. 2
    votes

    Great great work team! The most difficult part of this challenge is to understand the data, create new features and rerun the predictive models till you achieve a good accuracy.
    As you may mentioned if you run a predictive model with the initial dataset you will get an extremely low modelling accuracy.

    I will vote based on the below criteria:
    1. business understanding
    2. feature engineering
    3. modelling accuracy
    4. insights & final results

    Your approach is on the right path and you managed to create the new features that will help you identify the volume uplift drivers and increase the accuracy of the model. This is a great accuracy for this specific dataset! This comes as a result of your great business understanding of the case and the dataset as well

Leave a Reply