# Datathon NSI Solution – The curious case of ‘Household Budget Survey(HBS)’

The National Statistical Institute of Bulgaria (NSI) conducts annually a Household Budget Survey (HBS) with an objective to get reliable and scientifically founded data on the income, expenditure, consumption and other elements of the living standard of the population as well as changes, which have occurred during the years. NSI is considering a change in the periodicity of the Household Budget Survey from yearly to once on every five years,In order to optimize the cost of carrying out the survey. Hence We are creating a model which will predict household expenditure for the next four years using linear regression model and time series. The algorithms that we will be taking help from are linear regression model & Autoregressive integrated moving average(ARIMA). So lets not waste any time and move on with it !

2 ### The Curious Case Of ‘House Budget Survey’

#### STEP I : Understanding the data.

The Household Budget Survey is a sample survey by implementation of two-stage cluster. The general population from which the sample for the survey is formed comprises all the households in the country. Institutional households are not studied by the household budget survey. The unit of observation is every randomly chosen ordinary household irrespective of the number of members and their material and personal status. The sample size since 2010 is 3060 households each quarter spread into three sub-samples for 1 020 households. Each sub-sample is monitored for one month from each quarter, based on the rotation sample model. In this way, each household in the sample participates in the survey four months during a twelve-month period. The data provided is from 2010-2017.

Datasets provided :

• Data for total expenditure average per capita by COICOP.
• Monetary income by source of income.
• Population by 5 years for age, group & sex.
• Annual average wages and salaries of employees under labor contract.
• Population, employed & unemployed count of labor force.

The following packages have been used-

• dplyr
• ggplot2
• psych
• tidyr
• plotly
• forecast
• tseries
• lubridate

The Algorithms that have been used are-

Approach 1:

• Linear regression : Linear regression is used to explain the relationship between one dependent variable and one or more independent variables.
• Time Series Linear Model : To fit the linear model with multiple independent variables, in time series.

Approach 2:

• Time Series Linear Model : To fit the linear model with multiple independent variables, in time series.

Approach 3:

• Time series :Time series is a sequence of numerical data points in successive order. Time series tracks the movement of the chosen data points, such as a expenditure, over a specified period of time with data points recorded at regular intervals.

#### STEP III: Exploratory Data Analysis (EDA)

We have used linear regression to find the significant relation between expenditure and other parameters such as income, age,sex,population,pension etc. We have also taken the help of various charts to dig deep into the data.

• Some main observation are as follows:
• The income of a person highly affects the expenditure.
• Non-labor force affects the expenditure significantly.
• Expenditure does not depend greatly over unemployment benefits.
• People spend twice as much money  on food and non alcoholic beverages as they do on their housing.
• On an average citizens are earning almost 60% more than what they were earning in 2010 yet somehow the expenditure rate has not relatively increased.
• There is no significant rise in earnings unless you are an employee under labor contract.

We have used three different approaches to predict the household values.

Approach I: Using time series linear model

The provided data is in quarters but the expected prediction is supposed to be in years.  Hence the first step is to transform data into years.

1: Taken all the independent columns which highly affect the total expenditure(Consumer Expenditure, Non-Consumer Expenditure), into a dataset.

2: Divided the data into test and train data.

3: Used Time Series Linear Model to train the data based on the multiple significant columns identified and predict total expenditure.

4: Then used test data to forecast data.

[Note: Given values for the significant independent columns , we can predict the expenditure for future years]

#### Codes and Graphs for prediction on yearly basis

Train Dataset: 2010-2015

Test Dataset: 2016-2017

Predicted Features:

• Total Expenditure
• Consumer Expenditure
• Non-Consumer Expenditure

Significant Features:

• Average pension of one pensioner-BGN
• Wages and salaries
• Self-employment income
• Pensions
• Other social benefits
• Regular transfers from other households
• Persons not in the labour force
• Annual Average wages and salaries of the employees under labour contract

#### Approach 2: Prediction using Linear model time series on quarterly basis

Here are some predicted graphs that we performed on our test and train data.
(Train data from 2010-2015 ; Test data 2016-2017).
From this here are some following observations which is given below.

1. The following graph represents Total expenditure against yearly Quarters.
a. The Q1 (Quarter-1) of every year shows less expenditure compared to Q4 (Quarter-4) which represents seasonality.
b. This is because of the Christmas and New year’s eve that comes under Q4, where people celebrate on a larger scale.

Train Dataset: 2010-2015

Test Dataset: 2016-2017

Predicted Features:

• Total Expenditure
• Consumer Expenditure
• Non-Consumer Expenditure
• Food and non-alcoholic beverages
• Housing, water, electricity, gas and other fuels
• Taxes_and_social_insurance_contributions

Significant Features:

• Average pension of one pensioner-BGN
• Wages and salaries
• Self-employment income
• Pensions
• Other social benefits
• Regular transfers from other households

2. The following graph represents Total consumer expenditure against yearly Quarters.
a. It represents the linearity in trend that people’s expenditure is always high during the 4th Quarter of the year w.r.t to first 3 Quarters.
b. The expenditure in 3rd Quarter is comparatively high because of the SOFIA Restaurant week in September, as the expenditure on food and non – beverages is more as compared to Alcoholic beverages.

3. The following graph represents Total Non-Consumer expenditure against yearly Quarters.
a. Non-consumer expenditure includes expenditures such as Taxes and social insurance contributions, which is linear and proportional.
b. It also shows that there is not much increment in taxes and social insurance expenditure.

• 4. The following graph represents Total Food and non-alcoholic expenditure against yearly Quarters.
a. The following graph represents that the expenditure on food and non-alcoholic beverages increases gradually from Q1 (Quarter-1) to Q4 (Quarter -4).
b. And the alcoholic beverages expenditure is less as compared to food and non-alcoholic beverages.

6. The following graph represents Total Taxes expenditure against yearly Quarters.
a. It shows the linearity in the graph that with every year the taxes are increased by the government , which also contributes in the non-consumer expenditure

Approach 3: Prediction using AUTO ARIMA.

We have used auto-arima to forcast the household expenditure for next four years with reference to the historical household data from 2010.

## STEP V: Conclusion

We would recommend to go with approach one and two which is ‘Time series linear model’ to predict expenditure to build on few significant factors that are mentioned in above approach descriptions.

Thank-you. 🙂

#### 7 thoughts on “Datathon NSI Solution – The curious case of ‘Household Budget Survey(HBS)’”

1. laura says:
0

In the case of the linear regression model that you described: can you make it more clear, which were the observations in the train set, which were the observations in the test set? Which were the features and which were the predicted variables?
Can you also comment on the main difference between using a classical linear Regression and the ARIMA model? Which is more appropriate?

2. svetro says:
0

Thank you for working on the NSI case!
You are saying that the TS linear model is better but for example from the “food and non-alcoholic expenditures” graph it can be seen a lost of seasonality which can be crucial when calculating consumer price indicies. It would be better if you provided predicted vs expected values for some kind of error estimation. Otherwise the article is readable, friendly and shows dedication and understanding of a certain level of the subject.

1. omki says:
0

We have thought of seasonality in the data and found also that.
However, we tried to incorporate as many variables for prediction in the model to get as good fit as possible. As the time was less, we could not deal with the seasonality in an effective manner. We will further work on this to get a better model.
We have seen the R-square value which seemed to be good comparing the original and predicted and so for the time-constraints we have omitted presenting that however those were plotted in the graph. We will further update our article.

3. apoorvakesarwani says:
0

EDA: We used linear regression model to identify the most significant factors on which the change in expenditure is dependent.
Model: We used time series linear regression model to predict the household expenditure based on the above identified significant factors. Made model on both yearly basis and quarterly basis and predicted.
Used Auto Arima as another approach to make predictions.
Details of Training data, test data, predicted variables and features have been mentioned in the article now.
In Linear model we identify the linearity between the variables and in ARIMA model we use historical data to predict the future values.

4. junior says:
0

1. omki says:
5. mareykariss says: