Exploring Financial Distress part 2

Posted
0

Libraries Used¶

Authors:¶

Team 4:

• Stephen Panev
• Marin St
• Dayana Hristova
• Dimitar Lyubchev
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from pycaret.classification import *


Read the data and data prep¶

We are using the financial distress data of companies. This is part of a Kaggle dataset which can be viewed in the following link: https://www.kaggle.com/shebrahimi/financial-distress The below analysis is a remake of the analysis by the last notebook, only this time it is produced by the new Python library PyCaret.PyCaret is a new library that is presented as using less code. (https://pycaret.org/) Due to this reason we will not be as much descriptive as the last notebook. The results will be compared to the resorts produced by sklearn.

In [2]:
data = pd.read_csv('data/Financial Distress.csv')

In [3]:
data.shape

Out[3]:
(3672, 86)
In [4]:
data.columns

Out[4]:
Index(['Company', 'Time', 'Financial Distress', 'x1', 'x2', 'x3', 'x4', 'x5',
'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16',
'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26',
'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36',
'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46',
'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56',
'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64', 'x65', 'x66',
'x67', 'x68', 'x69', 'x70', 'x71', 'x72', 'x73', 'x74', 'x75', 'x76',
'x77', 'x78', 'x79', 'x80', 'x81', 'x82', 'x83'],
dtype='object')
In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3672 entries, 0 to 3671
Data columns (total 86 columns):
#   Column              Non-Null Count  Dtype
---  ------              --------------  -----
0   Company             3672 non-null   int64
1   Time                3672 non-null   int64
2   Financial Distress  3672 non-null   float64
3   x1                  3672 non-null   float64
4   x2                  3672 non-null   float64
5   x3                  3672 non-null   float64
6   x4                  3672 non-null   float64
7   x5                  3672 non-null   float64
8   x6                  3672 non-null   float64
9   x7                  3672 non-null   float64
10  x8                  3672 non-null   float64
11  x9                  3672 non-null   float64
12  x10                 3672 non-null   float64
13  x11                 3672 non-null   float64
14  x12                 3672 non-null   float64
15  x13                 3672 non-null   float64
16  x14                 3672 non-null   float64
17  x15                 3672 non-null   float64
18  x16                 3672 non-null   float64
19  x17                 3672 non-null   float64
20  x18                 3672 non-null   float64
21  x19                 3672 non-null   float64
22  x20                 3672 non-null   float64
23  x21                 3672 non-null   float64
24  x22                 3672 non-null   float64
25  x23                 3672 non-null   float64
26  x24                 3672 non-null   float64
27  x25                 3672 non-null   float64
28  x26                 3672 non-null   float64
29  x27                 3672 non-null   float64
30  x28                 3672 non-null   float64
31  x29                 3672 non-null   float64
32  x30                 3672 non-null   float64
33  x31                 3672 non-null   float64
34  x32                 3672 non-null   float64
35  x33                 3672 non-null   float64
36  x34                 3672 non-null   float64
37  x35                 3672 non-null   float64
38  x36                 3672 non-null   float64
39  x37                 3672 non-null   float64
40  x38                 3672 non-null   float64
41  x39                 3672 non-null   float64
42  x40                 3672 non-null   float64
43  x41                 3672 non-null   float64
44  x42                 3672 non-null   float64
45  x43                 3672 non-null   float64
46  x44                 3672 non-null   float64
47  x45                 3672 non-null   float64
48  x46                 3672 non-null   float64
49  x47                 3672 non-null   float64
50  x48                 3672 non-null   float64
51  x49                 3672 non-null   float64
52  x50                 3672 non-null   float64
53  x51                 3672 non-null   float64
54  x52                 3672 non-null   float64
55  x53                 3672 non-null   float64
56  x54                 3672 non-null   float64
57  x55                 3672 non-null   float64
58  x56                 3672 non-null   float64
59  x57                 3672 non-null   float64
60  x58                 3672 non-null   float64
61  x59                 3672 non-null   float64
62  x60                 3672 non-null   float64
63  x61                 3672 non-null   float64
64  x62                 3672 non-null   float64
65  x63                 3672 non-null   float64
66  x64                 3672 non-null   float64
67  x65                 3672 non-null   float64
68  x66                 3672 non-null   float64
69  x67                 3672 non-null   float64
70  x68                 3672 non-null   float64
71  x69                 3672 non-null   float64
72  x70                 3672 non-null   float64
73  x71                 3672 non-null   float64
74  x72                 3672 non-null   float64
75  x73                 3672 non-null   float64
76  x74                 3672 non-null   float64
77  x75                 3672 non-null   float64
78  x76                 3672 non-null   float64
79  x77                 3672 non-null   float64
80  x78                 3672 non-null   float64
81  x79                 3672 non-null   float64
82  x80                 3672 non-null   int64
83  x81                 3672 non-null   float64
84  x82                 3672 non-null   int64
85  x83                 3672 non-null   int64
dtypes: float64(81), int64(5)
memory usage: 2.4 MB

In [6]:
data['Company'].unique()

Out[6]:
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208,
209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247,
248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260,
261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273,
274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286,
287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325,
326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338,
339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351,
352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364,
365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377,
378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390,
391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403,
404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416,
417, 418, 419, 420, 421, 422])
In [7]:
data['Time'].unique()

Out[7]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
In [8]:
plt.figure(figsize=(18,10))
sns.distplot(data['Financial Distress'],kde=False, rug=True, color = 'blue');
plt.title('Financial Distress Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [9]:
data['Financial Distress'].describe()

Out[9]:
count    3672.000000
mean        1.040257
std         2.652227
min        -8.631700
25%         0.172275
50%         0.583805
75%         1.351750
max       128.400000
Name: Financial Distress, dtype: float64
In [10]:
plt.figure(figsize=(18,10))
sns.boxplot(x=data['Financial Distress'], color = 'green')
plt.title('Financial Distress Boxplot')
plt.xlabel('Values')
plt.grid(True)
plt.show()

In [11]:
max(data['Financial Distress'])

Out[11]:
128.4
In [12]:
plt.figure(figsize=(17,8))
sns.distplot(data[data['Financial Distress'] < 100]['Financial Distress'], kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram without outlier')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [13]:
data_average = data.groupby("Company", as_index=False).mean()
data_average = data_average.drop(['Time'], axis=1)
data_average.shape

Out[13]:
(422, 85)
In [14]:
data_average.head()

Out[14]:
Company Financial Distress x1 x2 x3 x4 x5 x6 x7 x8 ... x74 x75 x76 x77 x78 x79 x80 x81 x82 x83
0 1 -0.334323 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 -0.018738 ... 92.050750 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5
1 2 1.966056 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 0.218296 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5
2 3 -1.659900 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 -0.056892 ... 85.437000 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0
3 4 0.839656 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 0.171318 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5
4 5 1.969673 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 0.249389 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5

5 rows × 85 columns

In [15]:
data_one_period = data.drop_duplicates(subset=['Company'], keep='last')

In [16]:
time_to_merge = pd.DataFrame(data_one_period[['Company','Time']])

Out[16]:
Company Time
3 1 4
17 2 14
18 3 1
32 4 14
46 5 14
In [17]:
data_average_with_time = pd.merge(time_to_merge,data_average, on='Company')

Out[17]:
Company Time Financial Distress x1 x2 x3 x4 x5 x6 x7 ... x74 x75 x76 x77 x78 x79 x80 x81 x82 x83
0 1 4 -0.334323 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 ... 92.050750 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5
1 2 14 1.966056 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5
2 3 1 -1.659900 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 ... 85.437000 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0
3 4 14 0.839656 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5
4 5 14 1.969673 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5

5 rows × 86 columns

In [18]:
plt.figure(figsize=(17,8))
sns.distplot(data_average['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress of the data average Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [19]:
plt.figure(figsize=(17,8))
sns.distplot(data_average[data_average['Financial Distress'] < 10]['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram  of data average without outliers testing holdout')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [20]:
pd.qcut(data_average['Financial Distress'], 3).value_counts()

Out[20]:
(0.929, 32.813]                141
(-5.6850000000000005, 0.25]    141
(0.25, 0.929]                  140
Name: Financial Distress, dtype: int64
In [21]:
data_average['status'] = np.where(data_average['Financial Distress'] < 0.25,0,np.where(data_average['Financial Distress'] < 0.929, 1,2))

In [22]:
data_average = data_average.drop(['Financial Distress', 'Company'], axis=1)

In [23]:
data_average.head()

Out[23]:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x75 x76 x77 x78 x79 x80 x81 x82 x83 status
0 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 -0.018738 -0.163632 -0.027849 ... 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5 0
1 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 0.218296 0.583685 0.238587 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5 2
2 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 -0.056892 -0.167930 -0.034521 ... 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0 0
3 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 0.171318 0.245349 0.195789 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5 1
4 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 0.249389 0.443456 0.123553 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5 2

5 rows × 84 columns

Modeling¶

First we create a setup with the desired parameters. The report can be viewed below. The chosen methods are highlighted.

In [24]:
classification = setup(data_average, target = 'status', session_id = 42,normalize = True,
transformation = True,
ignore_low_variance = True,
remove_multicollinearity = True, multicollinearity_threshold = 0.75, train_size = 0.7,
silent = False)


Setup Succesfully Completed!

Description Value
0 session_id 42
1 Target Type Multiclass
2 Label Encoded None
3 Original Data (422, 84)
4 Missing Values False
5 Numeric Features 83
6 Categorical Features 0
7 Ordinal Features False
8 High Cardinality Features False
9 High Cardinality Method None
10 Sampled Data (422, 84)
11 Transformed Train Set (295, 46)
12 Transformed Test Set (127, 46)
13 Numeric Imputer mean
14 Categorical Imputer constant
15 Normalize True
16 Normalize Method zscore
17 Transformation True
18 Transformation Method yeo-johnson
19 PCA False
20 PCA Method None
21 PCA Components None
22 Ignore Low Variance True
23 Combine Rare Levels False
24 Rare Level Threshold None
25 Numeric Binning False
26 Remove Outliers False
27 Outliers Threshold None
28 Remove Multicollinearity True
29 Multicollinearity Threshold 0.750000
30 Clustering False
31 Clustering Iteration None
32 Polynomial Features False
33 Polynomial Degree None
34 Trignometry Features False
35 Polynomial Threshold None
36 Group Features False
37 Feature Selection False
38 Features Selection Threshold None
39 Feature Interaction False
40 Feature Ratio False
41 Interaction Threshold None

We compare some of the models based on some indicators. AUC for multiple classification is not present. The highest results are highlighted.

In [25]:
compare_models()

Out[25]:
Model Accuracy AUC Recall Prec. F1 Kappa
0 Random Forest Classifier 0.847400 0.000000 0.846300 0.858700 0.845800 0.771000
1 Extreme Gradient Boosting 0.837100 0.000000 0.836700 0.858600 0.838100 0.755800
2 Gradient Boosting Classifier 0.837000 0.000000 0.836300 0.852100 0.838800 0.755600
3 Light Gradient Boosting Machine 0.836900 0.000000 0.837000 0.856100 0.838400 0.755700
4 CatBoost Classifier 0.833700 0.000000 0.833700 0.851200 0.834500 0.750600
5 Ada Boost Classifier 0.817200 0.000000 0.817800 0.841500 0.817700 0.726100
6 Extra Trees Classifier 0.810100 0.000000 0.809300 0.831200 0.809600 0.715000
7 Logistic Regression 0.793200 0.000000 0.793000 0.810300 0.786700 0.689900
8 SVM - Linear Kernel 0.773000 0.000000 0.772600 0.782100 0.770700 0.659600
9 Linear Discriminant Analysis 0.732400 0.000000 0.732600 0.748300 0.729400 0.598900
10 Decision Tree Classifier 0.728700 0.000000 0.728100 0.762900 0.724200 0.592900
11 Ridge Classifier 0.715200 0.000000 0.715900 0.719300 0.707600 0.573200
12 Quadratic Discriminant Analysis 0.678700 0.000000 0.678100 0.701900 0.680400 0.518000
13 K Neighbors Classifier 0.647600 0.000000 0.647000 0.699800 0.652800 0.471400
14 Naive Bayes 0.610900 0.000000 0.612600 0.713100 0.602500 0.417400

We take a deeper look at some of the models. There are 10 folds as a default. We can preview the mean and the standard deviation.

In [26]:
rf = create_model('rf')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
1 0.8000 0.0 0.8000 0.8000 0.7949 0.7000
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8333 0.0 0.8333 0.8357 0.8331 0.7500
4 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
5 0.8966 0.0 0.8963 0.8959 0.8947 0.8446
6 0.8966 0.0 0.8926 0.8985 0.8960 0.8444
7 0.8276 0.0 0.8333 0.8387 0.8198 0.7425
8 0.7241 0.0 0.7222 0.7649 0.7329 0.5872
9 0.8621 0.0 0.8519 0.8851 0.8486 0.7914
Mean 0.8474 0.0 0.8463 0.8587 0.8458 0.7710
SD 0.0516 0.0 0.0512 0.0462 0.0502 0.0770
In [27]:
tuned_rf = tune_model('rf')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8308 0.8020 0.7000
1 0.8000 0.0 0.8000 0.8141 0.7989 0.7000
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8333 0.0 0.8333 0.8320 0.8313 0.7500
5 0.8276 0.0 0.8259 0.8305 0.8231 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.7931 0.0 0.8000 0.8759 0.7893 0.6926
8 0.7931 0.0 0.7926 0.8096 0.7958 0.6904
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8338 0.0 0.8333 0.8533 0.8331 0.7509
SD 0.0356 0.0 0.0343 0.0340 0.0361 0.0529

The evaluate function lets us view some of the results of the model after it has been tuned in the above code. Not all of the results are active in this instance.

In [28]:
evaluate_model(tuned_rf)

In [29]:
gb = create_model('gbc')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8500 0.8357 0.7500
1 0.8333 0.0 0.8333 0.8424 0.8364 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8259 0.8300 0.8273 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8630 0.8723 0.8613 0.7936
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8370 0.0 0.8363 0.8521 0.8388 0.7556
SD 0.0410 0.0 0.0406 0.0351 0.0389 0.0609
In [30]:
tuned_gb = tune_model('gbc')

Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7718 0.7397 0.6000
1 0.7333 0.0 0.7333 0.7235 0.7249 0.6000
2 0.9333 0.0 0.9333 0.9444 0.9346 0.9000
3 0.8333 0.0 0.8333 0.8500 0.8357 0.7500
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.9310 0.0 0.9296 0.9310 0.9310 0.8964
7 0.8621 0.0 0.8667 0.9045 0.8646 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.7931 0.0 0.7852 0.7994 0.7846 0.6876
Mean 0.8238 0.0 0.8237 0.8410 0.8236 0.7358
SD 0.0703 0.0 0.0705 0.0684 0.0710 0.1054
In [31]:
evaluate_model(tuned_gb)

In [32]:
dt = create_model('dt')

Accuracy AUC Recall Prec. F1 Kappa
0 0.6333 0.0 0.6333 0.7208 0.6212 0.4500
1 0.7000 0.0 0.7000 0.7175 0.6985 0.5500
2 0.7667 0.0 0.7667 0.8380 0.7593 0.6500
3 0.7667 0.0 0.7667 0.7685 0.7640 0.6500
4 0.8000 0.0 0.8000 0.8214 0.7963 0.7000
5 0.7586 0.0 0.7630 0.7931 0.7618 0.6394
6 0.7241 0.0 0.7185 0.7685 0.7214 0.5827
7 0.7931 0.0 0.7926 0.8022 0.7846 0.6893
8 0.7241 0.0 0.7222 0.7440 0.7307 0.5872
9 0.6207 0.0 0.6185 0.6552 0.6046 0.4304
Mean 0.7287 0.0 0.7281 0.7629 0.7242 0.5929
SD 0.0589 0.0 0.0595 0.0523 0.0622 0.0885

We will trying bagging. boosting and blending of decision trees. You can read more on the topic https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

In [33]:
bagged_dt = ensemble_model(dt)

Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
1 0.8333 0.0 0.8333 0.8424 0.8364 0.7500
2 0.9333 0.0 0.9333 0.9444 0.9346 0.9000
3 0.8333 0.0 0.8333 0.8357 0.8331 0.7500
4 0.8667 0.0 0.8667 0.8694 0.8623 0.8000
5 0.8621 0.0 0.8630 0.8859 0.8597 0.7929
6 0.8621 0.0 0.8630 0.8621 0.8621 0.7929
7 0.7241 0.0 0.7296 0.7189 0.7044 0.5879
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.7931 0.0 0.7852 0.7994 0.7846 0.6876
Mean 0.8333 0.0 0.8333 0.8426 0.8310 0.7501
SD 0.0574 0.0 0.0570 0.0589 0.0611 0.0858
In [34]:
print(bagged_dt)

OneVsRestClassifier(estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=42,
splitter='best'),
bootstrap=True,
bootstrap_features=False,
max_features=1.0,
max_samples=1.0,
n_estimators=10, n_jobs=None,
oob_score=False,
random_state=42, verbose=0,
warm_start=False),
n_jobs=None)

In [35]:
boosted_dt = ensemble_model(dt, method = 'Boosting', n_estimators=20)

Accuracy AUC Recall Prec. F1 Kappa
0 0.6333 0.0 0.6333 0.6899 0.6120 0.4500
1 0.7333 0.0 0.7333 0.7478 0.7302 0.6000
2 0.7667 0.0 0.7667 0.8380 0.7593 0.6500
3 0.8000 0.0 0.8000 0.8214 0.7963 0.7000
4 0.9000 0.0 0.9000 0.8993 0.8982 0.8500
5 0.7241 0.0 0.7296 0.7385 0.7224 0.5879
6 0.8276 0.0 0.8222 0.8632 0.8260 0.7397
7 0.6207 0.0 0.6185 0.6148 0.6166 0.4304
8 0.6897 0.0 0.6889 0.7124 0.6958 0.5356
9 0.6897 0.0 0.6889 0.7123 0.6814 0.5348
Mean 0.7385 0.0 0.7381 0.7638 0.7338 0.6078
SD 0.0828 0.0 0.0826 0.0842 0.0854 0.1240
In [36]:
tuned_bagged_dt = tune_model('dt', ensemble=True, method='Bagging')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
1 0.8333 0.0 0.8333 0.8320 0.8313 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.9000 0.0 0.9000 0.9061 0.9015 0.8500
5 0.8276 0.0 0.8296 0.8584 0.8198 0.7415
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.9045 0.8574 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8506 0.0 0.8504 0.8685 0.8503 0.7760
SD 0.0359 0.0 0.0356 0.0345 0.0352 0.0535
In [37]:
blend_soft = blend_models(method = 'soft')

Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7718 0.7397 0.6000
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8333 0.0 0.8333 0.8889 0.8375 0.7500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.9000 0.0 0.9000 0.8993 0.8982 0.8500
5 0.8276 0.0 0.8259 0.8305 0.8231 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8372 0.0 0.8367 0.8549 0.8382 0.7559
SD 0.0496 0.0 0.0494 0.0411 0.0470 0.0741
In [38]:
cb = create_model('catboost')

Accuracy AUC Recall Prec. F1 Kappa
0 0.7667 0.0 0.7667 0.7980 0.7709 0.6500
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8337 0.0 0.8337 0.8512 0.8345 0.7506
SD 0.0447 0.0 0.0446 0.0390 0.0439 0.0668
In [39]:
egb = create_model('xgboost')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8439 0.8052 0.7000
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7926 0.8011 0.7924 0.6898
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.9045 0.8574 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8798 0.8573 0.7917
Mean 0.8371 0.0 0.8367 0.8586 0.8381 0.7558
SD 0.0374 0.0 0.0371 0.0374 0.0355 0.0557
In [40]:
lgb = create_model('lightgbm')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8393 0.8056 0.7000
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.9000 0.0 0.9000 0.9061 0.9015 0.8500
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8621 0.0 0.8667 0.8879 0.8603 0.7940
6 0.8276 0.0 0.8259 0.8370 0.8308 0.7411
7 0.7931 0.0 0.7963 0.8128 0.7902 0.6909
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8369 0.0 0.8370 0.8561 0.8384 0.7557
SD 0.0490 0.0 0.0483 0.0402 0.0468 0.0728
In [41]:
tuned_lgb = tune_model('lightgbm')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8308 0.8020 0.7000
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8333 0.8597 0.8202 0.7425
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.7931 0.0 0.7963 0.8314 0.7958 0.6915
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8268 0.0 0.8270 0.8491 0.8279 0.7405
SD 0.0430 0.0 0.0421 0.0351 0.0411 0.0637

Finally we blend several models with 2 different methods of voting soft and hard version. You can read more information of different methods of ensemble methods. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/.

In [42]:
blend_specific_soft = blend_models(estimator_list = [rf, egb, lgb, gb], method = 'soft')
# not catboost for the function

Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8604 0.8379 0.7500
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8296 0.8527 0.8278 0.7420
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8438 0.0 0.8437 0.8625 0.8454 0.7659
SD 0.0422 0.0 0.0415 0.0334 0.0395 0.0625
In [43]:
blend_specific_hard = blend_models(estimator_list = [rf,egb, lgb, gb], method = 'hard')

Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8604 0.8379 0.7500
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7926 0.8011 0.7924 0.6898
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8966 0.0 0.9000 0.9224 0.8948 0.8455
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8439 0.0 0.8433 0.8605 0.8450 0.7659
SD 0.0388 0.0 0.0388 0.0391 0.0373 0.0579

We can preview the agreement and disagreement of the models in a plot, this is not present for every model.

In [44]:
stack_soft = stack_models([rf,egb, lgb, gb, cb], plot=True)

Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7607 0.7387 0.6000
1 0.8333 0.0 0.8333 0.8350 0.8329 0.7500
2 0.8667 0.0 0.8667 0.8674 0.8624 0.8000
3 0.8000 0.0 0.8000 0.8148 0.8038 0.7000
4 0.8333 0.0 0.8333 0.8485 0.8360 0.7500
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8556 0.8695 0.8597 0.7921
7 0.6897 0.0 0.6926 0.6931 0.6749 0.5364
8 0.7586 0.0 0.7556 0.7686 0.7623 0.6381
9 0.7241 0.0 0.7222 0.7303 0.7260 0.5865
Mean 0.7894 0.0 0.7889 0.8017 0.7886 0.6844
SD 0.0579 0.0 0.0570 0.0573 0.0592 0.0864
In [45]:
stack_hard = stack_models([rf,egb, lgb, gb, cb],  method='hard', plot=True)

Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7607 0.7387 0.6000
1 0.8333 0.0 0.8333 0.8350 0.8329 0.7500
2 0.8667 0.0 0.8667 0.8674 0.8624 0.8000
3 0.8000 0.0 0.8000 0.8148 0.8038 0.7000
4 0.8333 0.0 0.8333 0.8485 0.8360 0.7500
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8556 0.8695 0.8597 0.7921
7 0.6897 0.0 0.6926 0.6931 0.6749 0.5364
8 0.7586 0.0 0.7556 0.7686 0.7623 0.6381
9 0.7241 0.0 0.7222 0.7303 0.7260 0.5865
Mean 0.7894 0.0 0.7889 0.8017 0.7886 0.6844
SD 0.0579 0.0 0.0570 0.0573 0.0592 0.0864

Test the models on a holdout sample¶

Finally we will test some of the models on the test sample. The predictions can be viewed below.

In [46]:
gb_holdout_pred = predict_model(gb)

Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.7559 0 0.7551 0.757 0.7564 0.6339
Out[46]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9896
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.9908
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9869
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9968
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.9850

5 rows × 49 columns

In [47]:
lgb_holdout_pred = predict_model(lgb)

Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.7717 0 0.7709 0.7835 0.7752 0.6576
Out[47]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9993
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.9900
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9991
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9995
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.9852

5 rows × 49 columns

In [48]:
tuned_gb_holdout_pred = predict_model(tuned_gb)

Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.8031 0 0.8021 0.801 0.8015 0.7046
Out[48]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9974
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 1.0000
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9940
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 1.0000
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 1.0000

5 rows × 49 columns

In [49]:
rf_holdout_pred = predict_model(tuned_rf)

Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.8031 0 0.8021 0.8016 0.7988 0.7045
Out[49]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9315
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.7414
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.8973
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9752
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.8137

5 rows × 49 columns

Conclusion¶

The models produced by the PyCaret library have better results then similar models produces by sklearn. The differences between the training and testing scores are smaller.Some of the ensemble methods create better models other worsen the results. There instances that the test accuracy can be up to 15% higher using PyCaret. This discrepancy is suspicious and it needs to be investigated further. Due to the fact that sklearn is older and more widely used we will give it a higher faith for the results. For further research on the topic there needs to be a greater investigation of the methods and documentation applied by PyCaret to find the reasons behind the stark differences.

In [ ]: