%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
from lightgbm import LGBMClassifier
import xgboost as xgb
from statsmodels.stats.outliers_influence import variance_inflation_factor
from itertools import cycle
import random as r
Read the data¶
We are using the financial distress data of companies. This is part of a Kaggle dataset which can be viewed in the following link: https://www.kaggle.com/shebrahimi/financial-distress
data = pd.read_csv('data/Financial Distress.csv')
data.shape
There are 3672 rows and 86 columns in the dataset. The names of the columns can be viewed below.Unfortunately most of the columns are not named with their descriptive names. The contain variable of financial and not financial nature. This will hider our analysis without using business logic.
data.columns
data.info()
You can view the columns they are mostly float64, but there are some int64 which will mean different categories.
EDA (Exploratory Data Analysis)¶
We have a total of 422 companies in the dataset.
data['Company'].unique()
data['Time'].unique()
There are 14 periods in total, although not all companies are present till the end. Below we can view the histogram of the Financial Distress Variable. There is clearly an outlier.
plt.figure(figsize=(18,10))
sns.distplot(data['Financial Distress'],kde=False, rug=True, color = 'blue');
plt.title('Financial Distress Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
data['Financial Distress'].describe()
The descriptive statistics prove the point as well as the Boxplot. It is possible that the outlier is a flawed value.
plt.figure(figsize=(18,10))
sns.boxplot(x=data['Financial Distress'], color = 'green')
plt.title('Financial Distress Boxplot')
plt.xlabel('Values')
plt.grid(True)
plt.show()
max(data['Financial Distress'])
plt.figure(figsize=(17,8))
sns.distplot(data[data['Financial Distress'] < 100]['Financial Distress'], kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram without outlier')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
After removing the outlier value out distribution is more normal looking, although it is still slightly rightly skewed. We will cut the column into 3 categories with equal frequencies. We can get the precise border areas of the bins.
pd.qcut(data['Financial Distress'], 3).value_counts()
data['status'] = np.where(data['Financial Distress'] < 0.295,0,np.where(data['Financial Distress'] < 1.021, 1,2))
data['status'].value_counts()
Selecting data for modeling¶
This is a panel data so we have to deal with time. We select the last period of every company that we will use for modeling. Another holdout dataset for testing will be produced this is for the first period of the companies. We need to check if the last period is representative for every period. If this is not the case, it means we are losing information in the process and we need to think about an alternative process.
data_one_period = data.drop_duplicates(subset=['Company'], keep='last')
data_one_period_test = data.drop_duplicates(subset=['Company'], keep='first')
data_final_test = data_one_period.drop(['Financial Distress', 'Company'], axis=1)
data_final_test.head()
data_one_period.shape
In the selected for modeling part of the dataset (data_one_period) we have 422 rows and 87 columns. Here we have the following distribution of the financial distress which we have converted into the status column. Unlike the original split proposed by Kaggles team this is far more balanced.
data_one_period['status'].value_counts()
data_one_period.head()
data_final = data_one_period.drop(['Financial Distress', 'Company'], axis=1)
The none useful column will be dropped. Here is a preview. Below that we have produced a heatmap of the correlation matrix. It is very large and it will not be very useful, so it is comment out.
data_final.head()
#plt.figure(figsize=(20,10))
#plt.title('Correlation Matrix')
#sns.heatmap(data_final.corr(), annot=True)
#plt.show()
Now we will use the correlation coefficients for variable selection. Here are the correlations of the predictors against the status column in absolute values sorted in ascending order. We choose to select the values with the correlation coefficient higher then 10%, but lower then 70%.
abs(data_final.corr()['status']).sort_values(ascending=True)
variables = abs(data_final.corr()['status']).sort_values(ascending=True)
v1 = variables[0.05 < variables.values]
v2 = v1[v1.values < 0.7]
len(v2.index)
ind = v2.index
v2.index
y = data_final['status']
X = data_final[ind]
X.shape
We need to check about multicolinearity between the predictors. The VIF (Variance Inflation Factor) will be used. A function has been written to help us with the variable selection. For the curious reader check more on the topic here https://en.wikipedia.org/wiki/Variance_inflation_factor.
def calc_vif(X):
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)
vif_df = calc_vif(X)
vif_df.shape
vif_df_without_corr = vif_df[vif_df['VIF'] < 10]
vif_df_without_corr.shape
We have chosen a cutoff rate of 10, it could be lower. Finally we get only 24 variables.
vif_df_without_corr['variables']
X = X[vif_df_without_corr['variables']]
X.shape
We produce a heatmap of the correlation matrix of the leftover values. We can see that there is no multicolinearity.
plt.figure(figsize=(20,10))
plt.title('Correlation Matrix of the selected predictors')
sns.heatmap(X.corr(), annot=True)
plt.show()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42)
Finally we have produces a test train slit with 30 percent test size. The random state of 42 has been used for reproduceability.
Modeling on the selected data¶
Firstly we have tried scaling with the MinMaxScaler. This has not produced much improvement so, it has been dropped out. The predictors are not that different on the scales basis compared to each other.
#scaler = MinMaxScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)
We have written a function evaluate the performance of the model using accuracy both on the training and test dataset, as well as the root mean squired error, and the classification report based on the confusion matrix, with the precision, recall, f1-score and support.
def evaluate(model, train_features, train_labels, test_features, test_labels):
train_acc = model.score(train_features, train_labels)
predictions = model.predict(test_features)
rmse = np.sqrt(mean_squared_error(test_labels, predictions))
accuracy = accuracy_score(test_labels, predictions)
class_report = classification_report(test_labels, predictions)
print('Model Performance')
print('Training data performance {:0.2f}'.format(train_acc))
print('Root Mean Squired Error: {:0.2f}'.format(rmse))
print('Test Accuracy: {:0.2f}.'.format(accuracy))
print(class_report)
The first model that we will use is the logistic regression. The results are not that impressive.
clf_base = LogisticRegression(random_state=42).fit(X_train, y_train)
evaluate(clf_base, X_train, y_train, X_test, y_test)
X_test_holdout = data_one_period_test[vif_df_without_corr['variables']]
X_test_holdout.shape
Now we will test it on the hold out sample.
plt.figure(figsize=(17,8))
sns.distplot(data_one_period_test[data_one_period_test['Financial Distress'] < 100]['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram testing holdout')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
data_one_period_test['status'] = np.where(data_one_period_test['Financial Distress'] < 0.295,0,
np.where(data_one_period_test['Financial Distress'] < 1.021, 1,2))
y_test_holdout = data_one_period_test['status']
print(y_test_holdout.shape)
print(X_test_holdout.shape)
The results are not that different, even better in this case.
clf_base = LogisticRegression(random_state=42).fit(X_train, y_train)
evaluate(clf_base, X_train, y_train, X_test_holdout, y_test_holdout)
Now we will create a Random Forest Classifier. The training performance is significantly better meaning there is overfitting. The holdout data proves further this issue.
rf_base = RandomForestClassifier(random_state=42).fit(X_train, y_train)
evaluate(rf_base, X_train, y_train, X_test, y_test)
evaluate(rf_base, X_train, y_train, X_test_holdout, y_test_holdout)
Now it is time for the AdaBoost classifier. We still have overfitting but to a lesser extent.
ab_base = AdaBoostClassifier(random_state=0).fit(X_train, y_train)
evaluate(ab_base, X_train, y_train, X_test, y_test)
ab_base = AdaBoostClassifier(random_state=42).fit(X_train, y_train)
evaluate(ab_base, X_train, y_train, X_test_holdout, y_test_holdout)
We introduce the Support Vector Classifier. There are several kernels to choose from. They don't produce very high accuracy, so they will not be considered further. It should be noted that the linear kernel is slower compared to the other kernels.
sv_base_rbf = SVC(kernel='rbf').fit(X_train, y_train)
evaluate(sv_base_rbf, X_train, y_train, X_test, y_test)
sv_base_sig = SVC(kernel='sigmoid').fit(X_train, y_train)
evaluate(sv_base_sig, X_train, y_train, X_test, y_test)
#sv_base_lin = SVC(kernel='linear').fit(X_train, y_train)
#evaluate(sv_base_lin, X_train, y_train, X_test, y_test)
This is the extreme gradient booting classifier from another library different from sklearn. There is still the issue of overfitting.
xgb_model_base = xgb.XGBClassifier(objective="multi:softprob", random_state=42).fit(X_train, y_train)
evaluate(xgb_model_base, X_train, y_train, X_test, y_test)
evaluate(xgb_model_base, X_train, y_train, X_test_holdout, y_test_holdout)
The last model to consider is the Light Gradient Boosting classifier. There is still the issue of overfitting.
lgb_model = LGBMClassifier(random_state=42).fit(X_train, y_train)
evaluate(lgb_model, X_train, y_train, X_test, y_test)
evaluate(lgb_model, X_train, y_train, X_test_holdout, y_test_holdout)
Considering the finding we can infer that there there is a significant difference between the first and the last time period. So we must take another approach. We will take the mean of all the values for every company in the dataset.
Model building of average data¶
data_average = data.groupby("Company", as_index=False).mean()
data_average = data_average.drop(['Time'], axis=1)
data_average.shape
data_average.head()
We need to change the Time column with the full time. So we will drop the average age column and merge the tables with the one below to get the needed total Time periods.
time_to_merge = pd.DataFrame(data_one_period[['Company','Time']])
time_to_merge.head()
data_average_with_time = pd.merge(time_to_merge,data_average, on='Company')
data_average_with_time.head()
plt.figure(figsize=(17,8))
sns.distplot(data_average['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress of the data average Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
plt.figure(figsize=(17,8))
sns.distplot(data_average[data_average['Financial Distress'] < 10]['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram of data average without outliers testing holdout')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
pd.qcut(data_average['Financial Distress'], 3).value_counts()
data_average['status'] = np.where(data_average['Financial Distress'] < 0.25,0,np.where(data_average['Financial Distress'] < 0.929, 1,2))
data_average = data_average.drop(['Financial Distress'], axis=1)
abs(data_average.corr()['status']).sort_values(ascending=True)
variables = abs(data_average.corr()['status']).sort_values(ascending=True)
v12 = variables[0.05 < variables.values]
v22 = v12[v12.values < 0.7]
len(v22.index)
ind1 = v22.index
v22.index
y1 = data_final['status']
X1 = data_final[ind1]
vif_df1 = calc_vif(X1)
vif_df1.shape
We change the VIF threshold to 7 to remove the multicolinearity.
vif_df_without_corr1 = vif_df1[vif_df1['VIF'] < 7]
vif_df_without_corr1.shape
X1 = X1[vif_df_without_corr1['variables']]
X1.shape
plt.figure(figsize=(20,10))
plt.title('Correlation Matrix of the selected predictors for data average')
sns.heatmap(X1.corr(), annot=True)
plt.show()
X_train1, X_test1, y_train1, y_test1 = train_test_split(
X1, y1, test_size=0.3, random_state=42)
A pipeline has been created with the model and polynomial features, you can review the results below, it seems to worsen the results, so we will not explore it further.
polynomial_features = PolynomialFeatures(degree=4,
include_bias=False)
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", LogisticRegression(random_state=42))])
clf_avg = LogisticRegression(random_state=42).fit(X_train1, y_train1)
evaluate(clf_avg, X_train1, y_train1, X_test1, y_test1)
clf_avg_pipe = pipeline.fit(X_train1, y_train1)
evaluate(clf_avg_pipe, X_train1, y_train1, X_test1, y_test1)
pipeline_rf = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", RandomForestClassifier(random_state=42))])
rf_avg = RandomForestClassifier(random_state=42).fit(X_train1, y_train1)
evaluate(rf_avg, X_train1, y_train1, X_test1, y_test1)
rf_avg_pipe = pipeline_rf.fit(X_train1, y_train1)
evaluate(rf_avg_pipe, X_train1, y_train1, X_test1, y_test1)
ab_avg = AdaBoostClassifier(random_state=42).fit(X_train1, y_train1)
evaluate(ab_avg, X_train1, y_train1, X_test1, y_test1)
sv_avg_rbf = SVC(kernel='rbf', random_state = 42).fit(X_train1, y_train1)
evaluate(sv_avg_rbf, X_train1, y_train1, X_test1, y_test1)
xgb_model_avg = xgb.XGBClassifier(objective="multi:softprob", random_state=42).fit(X_train1, y_train1)
evaluate(xgb_model_avg, X_train1, y_train1, X_test1, y_test1)
lgb_model_avg = LGBMClassifier(random_state=42).fit(X_train1, y_train1)
evaluate(lgb_model_avg, X_train1, y_train1, X_test1, y_test1)
pipeline_lgb = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", LGBMClassifier(random_state=42))])
lgb_model_avg_pipe = pipeline_lgb.fit(X_train1, y_train1)
evaluate(lgb_model_avg_pipe, X_train1, y_train1, X_test1, y_test1)
Random Search with the best models¶
Grid search has been tried, but it takes to much time, so we will try the less computationally intensive random search with Random forest and the Light Gradient Boosting Classifier.
params = {'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
rf_random_avg = RandomizedSearchCV(estimator = RandomForestClassifier(random_state=42),param_distributions = params,
n_iter = 100,cv = 3,random_state=42, n_jobs = -1)
rf_avg_tuned = rf_random_avg.fit(X_train1, y_train1)
rf_avg_tuned.best_params_
rf_avg_final = RandomForestClassifier(random_state=42,n_estimators = 400,
min_samples_split = 2,
min_samples_leaf = 4,
max_features = 'sqrt',
max_depth = 70,
bootstrap =True ).fit(X_train1, y_train1)
evaluate(rf_avg_final, X_train1, y_train1, X_test1, y_test1)
The hyperparameters form PyCaret (check here for more information https://pycaret.org/) have been taken, but it seems that there are significantly lower results. The results are not that good even after the hyperparameter tuning. We a loosing information when removing of averaging the different periods of the dataset. We will try to analyse the time series to see if there are any trends or seasonality.
Time series¶
data.describe()
We can check the summary statistics of every column of the full dataset. Dimitar has stated that there are outlier values that impact significantly the financial distress variable. We will try to plot some of the values developing over time to check if there are trends.
data[data['Company'] == 2]
data.columns[3:-1]
data.loc[data['Company'] == 2].iloc[:,data.index[1]]
plt.figure(figsize=(17,8))
for i in range(len(data.columns[3:-1])):
plt.plot(data.loc[data['Company'] == 2]['Time'],
data.loc[data['Company'] == 2].iloc[:,data.index[i]], label = 'i')
plt.title('Line plot of Xs during the obseved period.')
#plt.legend()
plt.xlabel('Time')
plt.ylabel('values')
plt.grid(True)
plt.show()
We can see some of the changes of the development of the predictors during the observed period. There are some changes, but every company needs to be investigated.
Conclusion¶
The financial distress was binned into 3 equal parts.Several approaches have been tried. The first one was only using the last period. A lot of the models have significantly overfitted. Regularization needs to be implemented like tree pruning. There has been a discrepancy between the results using sklearn and PyCaret that needs to be investigated more thoroughly. The hold out dataset of a different period has been used to check the models and the results are lower, meaning we are loosing information when selecting only 1 period. So the average of all the periods has been used, this achieves accuracy of around 65% of the best models meaning that we again are loosing information.
The trajectory of every variable needs further investigation to see if there are trends, seasonality or idiosyncrasies.
To conclude the problem can be tried to be investigated as a regression problem, also binning might be implemented for the predictors. It is possible to view the the problem as a survival analysis problem https://en.wikipedia.org/wiki/Survival_analysis