Predicting Houshold Budgets¶
Authors: SoRd1, Jack, pr0faka, Kolio¶
Team: Pigeons¶
Statistics is the painful elaboration of the obvious.
Hello everyone :) We all hope that you had a great time during the Datathon, because we did.
We are working on the case from NSI - to predict the household expenditures per group for the years in which the survey are not conducted.
The project objective is to produce a working decision model which can accurately predict future household expenditure by groups, based on the data gathered for previous years. The benefits to this are optimizing both the timing and costs related to carrying out the survey. The downside of the approach is the possibility of making inaccurate predictions. To avoid this, we base the model on as much back data as possible and also account for additional factors like population growth/decline, inflation, employment, etc.
We have data for the following:
- Total expenditure average per capita by group (COICOP) and by quarters – 2010-2017
- Monetary income by source of income – 2010-2017
- Population by 5-years age group and sex – 2010-2017
- Population, labour force, employed, unemployed and persons not in the labour force, aged 15 years and over by level of education – 2010-2017 (thousands);
- Annual average wages and salaries of the employees under labour contract
The data is a bit messy so we will need some preprocesing to have it in a way that we can work with.
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import itertools
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.arima_model import ARIMA
from sklearn import linear_model, kernel_ridge, svm, neighbors, tree, ensemble
from sklearn.preprocessing import MinMaxScaler
path = os.getcwd()
delim = '/'
df_year = pd.DataFrame()
df_quarter = pd.DataFrame()
years = [i for i in range(2010,2018)]
######################## load, mege and explore the data
#extracting data that might be interesting/useful from the first excel file
df = pd.read_excel(path + delim + 'SRZ_GBS i obrazov_LFS.XLSX')
df.columns = ['timeframe','education', 'population', 'labour_force', 'employed', 'unemployed', 'not_in_labour_force']
df = df.drop(df.index[[0,1]])
df['timeframe'] = [i for i in range(2010,2018) for j in range(7)]
df_tmp = df[df['education'].str.contains('Total')]
df_tmp = df_tmp.set_index(df_tmp['timeframe'])
df_year = df_tmp[['population','labour_force']]
df_year['precent_emploeyd'] = df_tmp['employed'] / df_tmp['labour_force'] * 100
df_year = df_year.set_index(df_tmp['timeframe'])
df_tmp = df[df['education'].str.contains('Higher')].set_index(df_tmp['timeframe'])
df_year['unempl_higher_edu'] = df_tmp['unemployed'] / df_tmp['labour_force'] * 100
df_tmp = df[df['education'].str.contains('Lower')].set_index(df_tmp['timeframe'])
df_year['unempl_lower_edu'] = df_tmp['unemployed'] / df_tmp['labour_force'] * 100
#adding the data from the second sheet as well
df = pd.read_excel(path + delim + 'SRZ_GBS i obrazov_LFS.XLSX','STZ', header=2)
df.columns = years
df = df.dropna(axis = 0).T
df_year['avg_annual_income'] = df['Total']
#loading the data form the second excel file - it's merged to the first dataframe
df = pd.read_excel(path + delim + 'Pensions 2011-2017.xlsx')
df = df.dropna(axis='rows')[1:]
df.columns = ['timeframe', 'num_pensioners', 'num_pensions', 'avg_pension']
df_quarter = df[df['timeframe'].str.contains('average')]
df_quarter['timeframe'] = ['%d_q%d' % (x,y+1) for x in range(2011,2018) for y in range(4)]
df_quarter = df_quarter.set_index('timeframe')
df_tmp = df[df['timeframe'].str.contains('annual')]
df_tmp['timeframe'] = [i for i in range(2011,2018)]
df_tmp = df_tmp.set_index('timeframe')
df_year = df_year.join(df_tmp)
df = pd.read_excel(path + delim + 'nas2010_2017.xls')
df.columns = ['age'] + ['%d_%s' % (x,y) for x in range(2010,2018) for y in ['total','male','female']]
df = df.drop([0,1,2])
df = df.set_index('age').T
df = df[df.columns].astype(float)
df_tmp = df[df.index.str.contains('_total')]
df_tmp.rename(columns={'Total':'population_total'}, inplace=True)
df_tmp = df_tmp.set_index(pd.Index(years))
df_year = df_year.join(df_tmp)
df_year['male_femal_ratio'] = np.array(df['Total'][df.index.str.contains('_male')]) / np.array(df['Total'][df.index.str.contains('_female')])
df_year = df_year[df_year.columns].astype(float)
df = pd.read_excel(path + delim + 'BG_HBS_Monetary income_Q_2010-2017.xlsx')
df = df.T
df.columns = ['year','quarter'] + [x.lower().replace(' ','_') for x in df.iloc[1].values[2:]]
df = df.drop(df.index[[0,1]])
df = df.set_index(pd.Index(['%d_q%d' % (x,y) for x in range(2010,2018) for y in range(1,5)]))
df = df.drop(['year', 'quarter'], axis = 1)
df_quarter = df.join(df_quarter)
def to_years(df_src, ix_vals):
df_target = pd.DataFrame()
for ix in ix_vals:
df_target[ix] = df_src[df_src.index.str.contains(str(ix))].sum()
return df_target.T
tmp_df = to_years(df,years)
df_year = pd.merge(df_year, tmp_df, left_index=True, right_index=True)
df_year.head()
Now when we have the data by year we can see some plots to see if we can find something interesting.
plt.plot(df_year.population,df_year.labour_force)
plt.title("Population vs labour force")
plt.xlabel('Population')
plt.ylabel('Labour force')
plt.show()
Looks a bit strange but after all with the growth of the population the labour force is growing too.
plt.plot(df_year.index, df_year.precent_emploeyd)
plt.title('Percentage employed by years')
plt.ylabel('Percentage employed')
plt.xlabel('Years')
plt.show()
We can see a trend of growth so thats a good sign.
plt.plot(df_year.index, df_year.wages_and_salaries/4)
plt.plot(df_year.index, df_year.pensions/4)
plt.title('Yearly average incomes')
plt.xlabel('Years')
plt.ylabel('BGN')
plt.legend(['Wages','Pensions'])
plt.show()
Initially we looked and some basic indicators like Wages & Salaries, Pensions and Employment. We discovered a positive correlation between the population numbers and the labor force; an initial decline in the employment between 2010 and 2013 and then a steady increase since. Also we noticed that both major sources on income – Salaries and Pensions, generally increase each year, with pension always having a much smaller rise, which leads to a surge in the gap between them.
But since the data by year got low number of cases we will make our predictions on quarter bases.
df_main = pd.read_excel(path + delim + 'BG_HBS_COICOP-4_Q_2010_2017.xlsx')
df_main = df_main.dropna(axis=0)
categories = [x for x in df_main.index.values if pd.notna(x) and len(x) == 2]
df_main = df_main.loc[categories]
df_main.columns = ['category']+['%d_q%d' % (x,y+1) for x in range(2010,2018) for y in range(4)]
df_main['category'] = df_main['category'].apply(lambda x: x.strip().lower().replace(" ","_"))
df_main.set_index('category', inplace= True)
df_quarter['total_expenditure_1frame_ahed'] = df_main.sum().shift(-1)
df_main = df_main.T
df_quarter = pd.merge(df_quarter, df_main, left_index=True, right_index=True)
df_quarter = df_quarter[df_quarter.columns].astype(float)
df_quarter.head()
df_quarter.describe()
First we will try the ARIMA model for time series forecasting.
df_for_arima = df_quarter[['food_and_non-alcoholic_beverages', 'alcoholic_beverages_and_tobacco',
'clothing_and_garments',
'housing,_water,_electricity,_gas_and_other_fuels',
'furnishings,_household_equipment_and_routine_maintenance_of_the_house',
'health', 'transport', 'communication', 'recreation_and_culture',
'education', 'restaurants_and_hotels',
'miscellaneous_goods_and_services']]
di = {
'_q1':'-3',
'_q2':'-6',
'_q3':'-9',
'_q4':'-12'
}
tmp = df_for_arima[:-4].index
new_index = [x[:4] + di[x[4:]] for x in df_for_arima.index]
df_for_arima.index = pd.to_datetime(new_index)
for col in df_for_arima:
X = df_for_arima[col].values
size = int(len(X) * 0.90)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
print(col)
for t in range(len(test)):
model = ARIMA(history, order=(0,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted={}, expected={}'.format(yhat, obs))
print('')
And then we will try some of the standart ML models.
train_test_ratio = int(df_quarter.shape[0] * 0.70)
df_quarter = df_quarter.dropna()
features = df_quarter.drop('total_expenditure_1frame_ahed', axis = 1)
labels = df_quarter['total_expenditure_1frame_ahed']
features = features.transform(lambda x: list(itertools.chain.from_iterable(MinMaxScaler().fit_transform(x.reshape((len(x), 1))).tolist())))
train_set = features[:train_test_ratio]
Y_train = labels[:train_test_ratio]
test_set = features[train_test_ratio:]
Y_test = labels[train_test_ratio:]
models = ["linear_model.BayesianRidge()","linear_model.Lasso(alpha=0.01)","linear_model.Ridge (alpha = .9)", "kernel_ridge.KernelRidge(alpha=1.0)", "svm.SVR()", "neighbors.KNeighborsRegressor(n_neighbors=2)", "tree.DecisionTreeRegressor()", "linear_model.LinearRegression()", "ensemble.GradientBoostingRegressor()"]
def run_models(clf_list, x_train, y_train, x_test, y_test):
for clf in clf_list:
reg = eval(clf)
reg.fit(x_train, y_train)
print(reg.predict(x_test))
print(clf,reg.score(x_test,y_test))
print('\n')
run_models(models, train_set, Y_train, test_set, Y_test)
So what we see from evaluating our models is that noone is working properly. The next thing we will try is to perform grid search for optimazing the models perimeters and feature selection. But this is everything we have done for the time we had. We hope you enjoyed it :D
2 thoughts on “Datathon NSI Solution – Predicting Household Budgets”
If you are still editing the article: can you please give more substantial comments as to how are these visualizations helping you get closer to solving the given task, i.e. predicting expenditure for the next year, should the survey not take place.
Thank you for working on this case. The article is comprehensive without being too much but more graphs could have made it more convincing. The work done on the subject may not have given the answers we were looking for but it is a good starting point and was exposed in a good manner.