import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from pycaret.classification import *
Read the data and data prep¶
We are using the financial distress data of companies. This is part of a Kaggle dataset which can be viewed in the following link: https://www.kaggle.com/shebrahimi/financial-distress The below analysis is a remake of the analysis by the last notebook, only this time it is produced by the new Python library PyCaret.PyCaret is a new library that is presented as using less code. (https://pycaret.org/) Due to this reason we will not be as much descriptive as the last notebook. The results will be compared to the resorts produced by sklearn.
data = pd.read_csv('data/Financial Distress.csv')
data.shape
data.columns
data.info()
data['Company'].unique()
data['Time'].unique()
plt.figure(figsize=(18,10))
sns.distplot(data['Financial Distress'],kde=False, rug=True, color = 'blue');
plt.title('Financial Distress Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
data['Financial Distress'].describe()
plt.figure(figsize=(18,10))
sns.boxplot(x=data['Financial Distress'], color = 'green')
plt.title('Financial Distress Boxplot')
plt.xlabel('Values')
plt.grid(True)
plt.show()
max(data['Financial Distress'])
plt.figure(figsize=(17,8))
sns.distplot(data[data['Financial Distress'] < 100]['Financial Distress'], kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram without outlier')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
data_average = data.groupby("Company", as_index=False).mean()
data_average = data_average.drop(['Time'], axis=1)
data_average.shape
data_average.head()
data_one_period = data.drop_duplicates(subset=['Company'], keep='last')
time_to_merge = pd.DataFrame(data_one_period[['Company','Time']])
time_to_merge.head()
data_average_with_time = pd.merge(time_to_merge,data_average, on='Company')
data_average_with_time.head()
plt.figure(figsize=(17,8))
sns.distplot(data_average['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress of the data average Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
plt.figure(figsize=(17,8))
sns.distplot(data_average[data_average['Financial Distress'] < 10]['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram of data average without outliers testing holdout')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
pd.qcut(data_average['Financial Distress'], 3).value_counts()
data_average['status'] = np.where(data_average['Financial Distress'] < 0.25,0,np.where(data_average['Financial Distress'] < 0.929, 1,2))
data_average = data_average.drop(['Financial Distress', 'Company'], axis=1)
data_average.head()
Modeling¶
First we create a setup with the desired parameters. The report can be viewed below. The chosen methods are highlighted.
classification = setup(data_average, target = 'status', session_id = 42,normalize = True,
transformation = True,
ignore_low_variance = True,
remove_multicollinearity = True, multicollinearity_threshold = 0.75, train_size = 0.7,
silent = False)
We compare some of the models based on some indicators. AUC for multiple classification is not present. The highest results are highlighted.
compare_models()
We take a deeper look at some of the models. There are 10 folds as a default. We can preview the mean and the standard deviation.
rf = create_model('rf')
tuned_rf = tune_model('rf')
The evaluate function lets us view some of the results of the model after it has been tuned in the above code. Not all of the results are active in this instance.
evaluate_model(tuned_rf)
gb = create_model('gbc')
tuned_gb = tune_model('gbc')
evaluate_model(tuned_gb)
dt = create_model('dt')
We will trying bagging. boosting and blending of decision trees. You can read more on the topic https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9
bagged_dt = ensemble_model(dt)
print(bagged_dt)
boosted_dt = ensemble_model(dt, method = 'Boosting', n_estimators=20)
tuned_bagged_dt = tune_model('dt', ensemble=True, method='Bagging')
blend_soft = blend_models(method = 'soft')
cb = create_model('catboost')
egb = create_model('xgboost')
lgb = create_model('lightgbm')
tuned_lgb = tune_model('lightgbm')
Finally we blend several models with 2 different methods of voting soft and hard version. You can read more information of different methods of ensemble methods. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/.
blend_specific_soft = blend_models(estimator_list = [rf, egb, lgb, gb], method = 'soft')
# not catboost for the function
blend_specific_hard = blend_models(estimator_list = [rf,egb, lgb, gb], method = 'hard')
We can preview the agreement and disagreement of the models in a plot, this is not present for every model.
stack_soft = stack_models([rf,egb, lgb, gb, cb], plot=True)
stack_hard = stack_models([rf,egb, lgb, gb, cb], method='hard', plot=True)
Test the models on a holdout sample¶
Finally we will test some of the models on the test sample. The predictions can be viewed below.
gb_holdout_pred = predict_model(gb)
gb_holdout_pred.head()
lgb_holdout_pred = predict_model(lgb)
lgb_holdout_pred.head()
tuned_gb_holdout_pred = predict_model(tuned_gb)
tuned_gb_holdout_pred.head()
rf_holdout_pred = predict_model(tuned_rf)
rf_holdout_pred.head()
Conclusion¶
The models produced by the PyCaret library have better results then similar models produces by sklearn. The differences between the training and testing scores are smaller.Some of the ensemble methods create better models other worsen the results. There instances that the test accuracy can be up to 15% higher using PyCaret. This discrepancy is suspicious and it needs to be investigated further. Due to the fact that sklearn is older and more widely used we will give it a higher faith for the results. For further research on the topic there needs to be a greater investigation of the methods and documentation applied by PyCaret to find the reasons behind the stark differences.