Finance

Exploring Financial Distress part 2

0
votes

Libraries Used

Authors:

Team 4:

  • Stephen Panev
  • Marin St
  • Dayana Hristova
  • Dimitar Lyubchev
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from pycaret.classification import *

Read the data and data prep

We are using the financial distress data of companies. This is part of a Kaggle dataset which can be viewed in the following link: https://www.kaggle.com/shebrahimi/financial-distress The below analysis is a remake of the analysis by the last notebook, only this time it is produced by the new Python library PyCaret.PyCaret is a new library that is presented as using less code. (https://pycaret.org/) Due to this reason we will not be as much descriptive as the last notebook. The results will be compared to the resorts produced by sklearn.

In [2]:
data = pd.read_csv('data/Financial Distress.csv')
In [3]:
data.shape
Out[3]:
(3672, 86)
In [4]:
data.columns
Out[4]:
Index(['Company', 'Time', 'Financial Distress', 'x1', 'x2', 'x3', 'x4', 'x5',
       'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16',
       'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26',
       'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36',
       'x37', 'x38', 'x39', 'x40', 'x41', 'x42', 'x43', 'x44', 'x45', 'x46',
       'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56',
       'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64', 'x65', 'x66',
       'x67', 'x68', 'x69', 'x70', 'x71', 'x72', 'x73', 'x74', 'x75', 'x76',
       'x77', 'x78', 'x79', 'x80', 'x81', 'x82', 'x83'],
      dtype='object')
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3672 entries, 0 to 3671
Data columns (total 86 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Company             3672 non-null   int64  
 1   Time                3672 non-null   int64  
 2   Financial Distress  3672 non-null   float64
 3   x1                  3672 non-null   float64
 4   x2                  3672 non-null   float64
 5   x3                  3672 non-null   float64
 6   x4                  3672 non-null   float64
 7   x5                  3672 non-null   float64
 8   x6                  3672 non-null   float64
 9   x7                  3672 non-null   float64
 10  x8                  3672 non-null   float64
 11  x9                  3672 non-null   float64
 12  x10                 3672 non-null   float64
 13  x11                 3672 non-null   float64
 14  x12                 3672 non-null   float64
 15  x13                 3672 non-null   float64
 16  x14                 3672 non-null   float64
 17  x15                 3672 non-null   float64
 18  x16                 3672 non-null   float64
 19  x17                 3672 non-null   float64
 20  x18                 3672 non-null   float64
 21  x19                 3672 non-null   float64
 22  x20                 3672 non-null   float64
 23  x21                 3672 non-null   float64
 24  x22                 3672 non-null   float64
 25  x23                 3672 non-null   float64
 26  x24                 3672 non-null   float64
 27  x25                 3672 non-null   float64
 28  x26                 3672 non-null   float64
 29  x27                 3672 non-null   float64
 30  x28                 3672 non-null   float64
 31  x29                 3672 non-null   float64
 32  x30                 3672 non-null   float64
 33  x31                 3672 non-null   float64
 34  x32                 3672 non-null   float64
 35  x33                 3672 non-null   float64
 36  x34                 3672 non-null   float64
 37  x35                 3672 non-null   float64
 38  x36                 3672 non-null   float64
 39  x37                 3672 non-null   float64
 40  x38                 3672 non-null   float64
 41  x39                 3672 non-null   float64
 42  x40                 3672 non-null   float64
 43  x41                 3672 non-null   float64
 44  x42                 3672 non-null   float64
 45  x43                 3672 non-null   float64
 46  x44                 3672 non-null   float64
 47  x45                 3672 non-null   float64
 48  x46                 3672 non-null   float64
 49  x47                 3672 non-null   float64
 50  x48                 3672 non-null   float64
 51  x49                 3672 non-null   float64
 52  x50                 3672 non-null   float64
 53  x51                 3672 non-null   float64
 54  x52                 3672 non-null   float64
 55  x53                 3672 non-null   float64
 56  x54                 3672 non-null   float64
 57  x55                 3672 non-null   float64
 58  x56                 3672 non-null   float64
 59  x57                 3672 non-null   float64
 60  x58                 3672 non-null   float64
 61  x59                 3672 non-null   float64
 62  x60                 3672 non-null   float64
 63  x61                 3672 non-null   float64
 64  x62                 3672 non-null   float64
 65  x63                 3672 non-null   float64
 66  x64                 3672 non-null   float64
 67  x65                 3672 non-null   float64
 68  x66                 3672 non-null   float64
 69  x67                 3672 non-null   float64
 70  x68                 3672 non-null   float64
 71  x69                 3672 non-null   float64
 72  x70                 3672 non-null   float64
 73  x71                 3672 non-null   float64
 74  x72                 3672 non-null   float64
 75  x73                 3672 non-null   float64
 76  x74                 3672 non-null   float64
 77  x75                 3672 non-null   float64
 78  x76                 3672 non-null   float64
 79  x77                 3672 non-null   float64
 80  x78                 3672 non-null   float64
 81  x79                 3672 non-null   float64
 82  x80                 3672 non-null   int64  
 83  x81                 3672 non-null   float64
 84  x82                 3672 non-null   int64  
 85  x83                 3672 non-null   int64  
dtypes: float64(81), int64(5)
memory usage: 2.4 MB
In [6]:
data['Company'].unique()
Out[6]:
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
       183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
       196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208,
       209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
       222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
       235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247,
       248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260,
       261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273,
       274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286,
       287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
       300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
       313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325,
       326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338,
       339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351,
       352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364,
       365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377,
       378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390,
       391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403,
       404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416,
       417, 418, 419, 420, 421, 422])
In [7]:
data['Time'].unique()
Out[7]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
In [8]:
plt.figure(figsize=(18,10))
sns.distplot(data['Financial Distress'],kde=False, rug=True, color = 'blue');
plt.title('Financial Distress Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
In [9]:
data['Financial Distress'].describe()
Out[9]:
count    3672.000000
mean        1.040257
std         2.652227
min        -8.631700
25%         0.172275
50%         0.583805
75%         1.351750
max       128.400000
Name: Financial Distress, dtype: float64
In [10]:
plt.figure(figsize=(18,10))
sns.boxplot(x=data['Financial Distress'], color = 'green')
plt.title('Financial Distress Boxplot')
plt.xlabel('Values')
plt.grid(True)
plt.show()
In [11]:
max(data['Financial Distress'])
Out[11]:
128.4
In [12]:
plt.figure(figsize=(17,8))
sns.distplot(data[data['Financial Distress'] < 100]['Financial Distress'], kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram without outlier')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
In [13]:
data_average = data.groupby("Company", as_index=False).mean() 
data_average = data_average.drop(['Time'], axis=1)
data_average.shape
Out[13]:
(422, 85)
In [14]:
data_average.head()
Out[14]:
Company Financial Distress x1 x2 x3 x4 x5 x6 x7 x8 ... x74 x75 x76 x77 x78 x79 x80 x81 x82 x83
0 1 -0.334323 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 -0.018738 ... 92.050750 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5
1 2 1.966056 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 0.218296 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5
2 3 -1.659900 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 -0.056892 ... 85.437000 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0
3 4 0.839656 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 0.171318 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5
4 5 1.969673 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 0.249389 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5

5 rows × 85 columns

In [15]:
data_one_period = data.drop_duplicates(subset=['Company'], keep='last')
In [16]:
time_to_merge = pd.DataFrame(data_one_period[['Company','Time']])
time_to_merge.head()
Out[16]:
Company Time
3 1 4
17 2 14
18 3 1
32 4 14
46 5 14
In [17]:
data_average_with_time = pd.merge(time_to_merge,data_average, on='Company')
data_average_with_time.head()
Out[17]:
Company Time Financial Distress x1 x2 x3 x4 x5 x6 x7 ... x74 x75 x76 x77 x78 x79 x80 x81 x82 x83
0 1 4 -0.334323 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 ... 92.050750 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5
1 2 14 1.966056 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5
2 3 1 -1.659900 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 ... 85.437000 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0
3 4 14 0.839656 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5
4 5 14 1.969673 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 ... 86.854643 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5

5 rows × 86 columns

In [18]:
plt.figure(figsize=(17,8))
sns.distplot(data_average['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress of the data average Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
In [19]:
plt.figure(figsize=(17,8))
sns.distplot(data_average[data_average['Financial Distress'] < 10]['Financial Distress'],kde=False, rug=True, color = 'blue')
plt.title('Financial Distress Histogram  of data average without outliers testing holdout')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
In [20]:
pd.qcut(data_average['Financial Distress'], 3).value_counts()
Out[20]:
(0.929, 32.813]                141
(-5.6850000000000005, 0.25]    141
(0.25, 0.929]                  140
Name: Financial Distress, dtype: int64
In [21]:
data_average['status'] = np.where(data_average['Financial Distress'] < 0.25,0,np.where(data_average['Financial Distress'] < 0.929, 1,2))
In [22]:
data_average = data_average.drop(['Financial Distress', 'Company'], axis=1)
In [23]:
data_average.head()
Out[23]:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x75 x76 x77 x78 x79 x80 x81 x82 x83 status
0 1.179250 -0.011305 0.869128 0.940075 0.035843 0.126302 0.564090 -0.018738 -0.163632 -0.027849 ... 33.5625 32.486500 16.791750 15.750000 1.500000 22.0 -0.177584 31.5 50.5 0
1 1.539892 0.204816 0.628511 0.931229 0.302304 0.251645 1.068073 0.218296 0.583685 0.238587 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 29.0 1.895985 13.5 33.5 2
2 0.874400 -0.034676 0.793500 0.609520 -0.002632 -0.086847 0.506090 -0.056892 -0.167930 -0.034521 ... 27.0700 26.102000 16.000000 16.000000 0.200000 25.0 -0.303170 8.0 37.0 0
3 1.553275 0.138410 0.462178 0.759583 0.185367 0.168315 0.472444 0.171318 0.245349 0.195789 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 12.0 0.748936 34.5 50.5 1
4 1.127500 0.107643 0.743549 0.449420 0.108686 0.089244 0.664047 0.249389 0.443456 0.123553 ... 92.1600 89.237286 17.770857 15.142857 -2.721429 23.0 1.921633 18.5 43.5 2

5 rows × 84 columns

Modeling

First we create a setup with the desired parameters. The report can be viewed below. The chosen methods are highlighted.

In [24]:
classification = setup(data_average, target = 'status', session_id = 42,normalize = True, 
                  transformation = True, 
                  ignore_low_variance = True,
                  remove_multicollinearity = True, multicollinearity_threshold = 0.75, train_size = 0.7,
                       silent = False)
 
Setup Succesfully Completed!
Description Value
0 session_id 42
1 Target Type Multiclass
2 Label Encoded None
3 Original Data (422, 84)
4 Missing Values False
5 Numeric Features 83
6 Categorical Features 0
7 Ordinal Features False
8 High Cardinality Features False
9 High Cardinality Method None
10 Sampled Data (422, 84)
11 Transformed Train Set (295, 46)
12 Transformed Test Set (127, 46)
13 Numeric Imputer mean
14 Categorical Imputer constant
15 Normalize True
16 Normalize Method zscore
17 Transformation True
18 Transformation Method yeo-johnson
19 PCA False
20 PCA Method None
21 PCA Components None
22 Ignore Low Variance True
23 Combine Rare Levels False
24 Rare Level Threshold None
25 Numeric Binning False
26 Remove Outliers False
27 Outliers Threshold None
28 Remove Multicollinearity True
29 Multicollinearity Threshold 0.750000
30 Clustering False
31 Clustering Iteration None
32 Polynomial Features False
33 Polynomial Degree None
34 Trignometry Features False
35 Polynomial Threshold None
36 Group Features False
37 Feature Selection False
38 Features Selection Threshold None
39 Feature Interaction False
40 Feature Ratio False
41 Interaction Threshold None

We compare some of the models based on some indicators. AUC for multiple classification is not present. The highest results are highlighted.

In [25]:
compare_models()
Out[25]:
Model Accuracy AUC Recall Prec. F1 Kappa
0 Random Forest Classifier 0.847400 0.000000 0.846300 0.858700 0.845800 0.771000
1 Extreme Gradient Boosting 0.837100 0.000000 0.836700 0.858600 0.838100 0.755800
2 Gradient Boosting Classifier 0.837000 0.000000 0.836300 0.852100 0.838800 0.755600
3 Light Gradient Boosting Machine 0.836900 0.000000 0.837000 0.856100 0.838400 0.755700
4 CatBoost Classifier 0.833700 0.000000 0.833700 0.851200 0.834500 0.750600
5 Ada Boost Classifier 0.817200 0.000000 0.817800 0.841500 0.817700 0.726100
6 Extra Trees Classifier 0.810100 0.000000 0.809300 0.831200 0.809600 0.715000
7 Logistic Regression 0.793200 0.000000 0.793000 0.810300 0.786700 0.689900
8 SVM - Linear Kernel 0.773000 0.000000 0.772600 0.782100 0.770700 0.659600
9 Linear Discriminant Analysis 0.732400 0.000000 0.732600 0.748300 0.729400 0.598900
10 Decision Tree Classifier 0.728700 0.000000 0.728100 0.762900 0.724200 0.592900
11 Ridge Classifier 0.715200 0.000000 0.715900 0.719300 0.707600 0.573200
12 Quadratic Discriminant Analysis 0.678700 0.000000 0.678100 0.701900 0.680400 0.518000
13 K Neighbors Classifier 0.647600 0.000000 0.647000 0.699800 0.652800 0.471400
14 Naive Bayes 0.610900 0.000000 0.612600 0.713100 0.602500 0.417400

We take a deeper look at some of the models. There are 10 folds as a default. We can preview the mean and the standard deviation.

In [26]:
rf = create_model('rf')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
1 0.8000 0.0 0.8000 0.8000 0.7949 0.7000
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8333 0.0 0.8333 0.8357 0.8331 0.7500
4 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
5 0.8966 0.0 0.8963 0.8959 0.8947 0.8446
6 0.8966 0.0 0.8926 0.8985 0.8960 0.8444
7 0.8276 0.0 0.8333 0.8387 0.8198 0.7425
8 0.7241 0.0 0.7222 0.7649 0.7329 0.5872
9 0.8621 0.0 0.8519 0.8851 0.8486 0.7914
Mean 0.8474 0.0 0.8463 0.8587 0.8458 0.7710
SD 0.0516 0.0 0.0512 0.0462 0.0502 0.0770
In [27]:
tuned_rf = tune_model('rf')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8308 0.8020 0.7000
1 0.8000 0.0 0.8000 0.8141 0.7989 0.7000
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8333 0.0 0.8333 0.8320 0.8313 0.7500
5 0.8276 0.0 0.8259 0.8305 0.8231 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.7931 0.0 0.8000 0.8759 0.7893 0.6926
8 0.7931 0.0 0.7926 0.8096 0.7958 0.6904
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8338 0.0 0.8333 0.8533 0.8331 0.7509
SD 0.0356 0.0 0.0343 0.0340 0.0361 0.0529

The evaluate function lets us view some of the results of the model after it has been tuned in the above code. Not all of the results are active in this instance.

In [28]:
evaluate_model(tuned_rf)
In [29]:
gb = create_model('gbc')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8500 0.8357 0.7500
1 0.8333 0.0 0.8333 0.8424 0.8364 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8259 0.8300 0.8273 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8630 0.8723 0.8613 0.7936
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8370 0.0 0.8363 0.8521 0.8388 0.7556
SD 0.0410 0.0 0.0406 0.0351 0.0389 0.0609
In [30]:
tuned_gb = tune_model('gbc')
Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7718 0.7397 0.6000
1 0.7333 0.0 0.7333 0.7235 0.7249 0.6000
2 0.9333 0.0 0.9333 0.9444 0.9346 0.9000
3 0.8333 0.0 0.8333 0.8500 0.8357 0.7500
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.9310 0.0 0.9296 0.9310 0.9310 0.8964
7 0.8621 0.0 0.8667 0.9045 0.8646 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.7931 0.0 0.7852 0.7994 0.7846 0.6876
Mean 0.8238 0.0 0.8237 0.8410 0.8236 0.7358
SD 0.0703 0.0 0.0705 0.0684 0.0710 0.1054
In [31]:
evaluate_model(tuned_gb)
In [32]:
dt = create_model('dt')
Accuracy AUC Recall Prec. F1 Kappa
0 0.6333 0.0 0.6333 0.7208 0.6212 0.4500
1 0.7000 0.0 0.7000 0.7175 0.6985 0.5500
2 0.7667 0.0 0.7667 0.8380 0.7593 0.6500
3 0.7667 0.0 0.7667 0.7685 0.7640 0.6500
4 0.8000 0.0 0.8000 0.8214 0.7963 0.7000
5 0.7586 0.0 0.7630 0.7931 0.7618 0.6394
6 0.7241 0.0 0.7185 0.7685 0.7214 0.5827
7 0.7931 0.0 0.7926 0.8022 0.7846 0.6893
8 0.7241 0.0 0.7222 0.7440 0.7307 0.5872
9 0.6207 0.0 0.6185 0.6552 0.6046 0.4304
Mean 0.7287 0.0 0.7281 0.7629 0.7242 0.5929
SD 0.0589 0.0 0.0595 0.0523 0.0622 0.0885

We will trying bagging. boosting and blending of decision trees. You can read more on the topic https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

In [33]:
bagged_dt = ensemble_model(dt)
Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
1 0.8333 0.0 0.8333 0.8424 0.8364 0.7500
2 0.9333 0.0 0.9333 0.9444 0.9346 0.9000
3 0.8333 0.0 0.8333 0.8357 0.8331 0.7500
4 0.8667 0.0 0.8667 0.8694 0.8623 0.8000
5 0.8621 0.0 0.8630 0.8859 0.8597 0.7929
6 0.8621 0.0 0.8630 0.8621 0.8621 0.7929
7 0.7241 0.0 0.7296 0.7189 0.7044 0.5879
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.7931 0.0 0.7852 0.7994 0.7846 0.6876
Mean 0.8333 0.0 0.8333 0.8426 0.8310 0.7501
SD 0.0574 0.0 0.0570 0.0589 0.0611 0.0858
In [34]:
print(bagged_dt)
OneVsRestClassifier(estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                                      class_weight=None,
                                                                                      criterion='gini',
                                                                                      max_depth=None,
                                                                                      max_features=None,
                                                                                      max_leaf_nodes=None,
                                                                                      min_impurity_decrease=0.0,
                                                                                      min_impurity_split=None,
                                                                                      min_samples_leaf=1,
                                                                                      min_samples_split=2,
                                                                                      min_weight_fraction_leaf=0.0,
                                                                                      presort='deprecated',
                                                                                      random_state=42,
                                                                                      splitter='best'),
                                                bootstrap=True,
                                                bootstrap_features=False,
                                                max_features=1.0,
                                                max_samples=1.0,
                                                n_estimators=10, n_jobs=None,
                                                oob_score=False,
                                                random_state=42, verbose=0,
                                                warm_start=False),
                    n_jobs=None)
In [35]:
boosted_dt = ensemble_model(dt, method = 'Boosting', n_estimators=20)
Accuracy AUC Recall Prec. F1 Kappa
0 0.6333 0.0 0.6333 0.6899 0.6120 0.4500
1 0.7333 0.0 0.7333 0.7478 0.7302 0.6000
2 0.7667 0.0 0.7667 0.8380 0.7593 0.6500
3 0.8000 0.0 0.8000 0.8214 0.7963 0.7000
4 0.9000 0.0 0.9000 0.8993 0.8982 0.8500
5 0.7241 0.0 0.7296 0.7385 0.7224 0.5879
6 0.8276 0.0 0.8222 0.8632 0.8260 0.7397
7 0.6207 0.0 0.6185 0.6148 0.6166 0.4304
8 0.6897 0.0 0.6889 0.7124 0.6958 0.5356
9 0.6897 0.0 0.6889 0.7123 0.6814 0.5348
Mean 0.7385 0.0 0.7381 0.7638 0.7338 0.6078
SD 0.0828 0.0 0.0826 0.0842 0.0854 0.1240
In [36]:
tuned_bagged_dt = tune_model('dt', ensemble=True, method='Bagging')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
1 0.8333 0.0 0.8333 0.8320 0.8313 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.9000 0.0 0.9000 0.9061 0.9015 0.8500
5 0.8276 0.0 0.8296 0.8584 0.8198 0.7415
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.9045 0.8574 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8506 0.0 0.8504 0.8685 0.8503 0.7760
SD 0.0359 0.0 0.0356 0.0345 0.0352 0.0535
In [37]:
blend_soft = blend_models(method = 'soft')
Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7718 0.7397 0.6000
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8333 0.0 0.8333 0.8889 0.8375 0.7500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.9000 0.0 0.9000 0.8993 0.8982 0.8500
5 0.8276 0.0 0.8259 0.8305 0.8231 0.7411
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8372 0.0 0.8367 0.8549 0.8382 0.7559
SD 0.0496 0.0 0.0494 0.0411 0.0470 0.0741
In [38]:
cb = create_model('catboost')
Accuracy AUC Recall Prec. F1 Kappa
0 0.7667 0.0 0.7667 0.7980 0.7709 0.6500
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.9000 0.0 0.9000 0.9231 0.9019 0.8500
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8337 0.0 0.8337 0.8512 0.8345 0.7506
SD 0.0447 0.0 0.0446 0.0390 0.0439 0.0668
In [39]:
egb = create_model('xgboost')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8439 0.8052 0.7000
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7926 0.8011 0.7924 0.6898
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.9045 0.8574 0.7943
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8798 0.8573 0.7917
Mean 0.8371 0.0 0.8367 0.8586 0.8381 0.7558
SD 0.0374 0.0 0.0371 0.0374 0.0355 0.0557
In [40]:
lgb = create_model('lightgbm')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8393 0.8056 0.7000
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.9000 0.0 0.9000 0.9061 0.9015 0.8500
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8621 0.0 0.8667 0.8879 0.8603 0.7940
6 0.8276 0.0 0.8259 0.8370 0.8308 0.7411
7 0.7931 0.0 0.7963 0.8128 0.7902 0.6909
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8369 0.0 0.8370 0.8561 0.8384 0.7557
SD 0.0490 0.0 0.0483 0.0402 0.0468 0.0728
In [41]:
tuned_lgb = tune_model('lightgbm')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8000 0.0 0.8000 0.8308 0.8020 0.7000
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8333 0.8597 0.8202 0.7425
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.7931 0.0 0.7963 0.8314 0.7958 0.6915
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8276 0.0 0.8222 0.8303 0.8239 0.7401
Mean 0.8268 0.0 0.8270 0.8491 0.8279 0.7405
SD 0.0430 0.0 0.0421 0.0351 0.0411 0.0637

Finally we blend several models with 2 different methods of voting soft and hard version. You can read more information of different methods of ensemble methods. https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/.

In [42]:
blend_specific_soft = blend_models(estimator_list = [rf, egb, lgb, gb], method = 'soft')
# not catboost for the function
Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8604 0.8379 0.7500
1 0.8667 0.0 0.8667 0.8727 0.8682 0.8000
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.8276 0.0 0.8296 0.8527 0.8278 0.7420
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8621 0.0 0.8667 0.8793 0.8609 0.7940
8 0.7241 0.0 0.7259 0.7706 0.7339 0.5887
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8438 0.0 0.8437 0.8625 0.8454 0.7659
SD 0.0422 0.0 0.0415 0.0334 0.0395 0.0625
In [43]:
blend_specific_hard = blend_models(estimator_list = [rf,egb, lgb, gb], method = 'hard')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8333 0.0 0.8333 0.8604 0.8379 0.7500
1 0.8333 0.0 0.8333 0.8426 0.8341 0.7500
2 0.8667 0.0 0.8667 0.9048 0.8704 0.8000
3 0.8667 0.0 0.8667 0.8788 0.8677 0.8000
4 0.8667 0.0 0.8667 0.8667 0.8667 0.8000
5 0.7931 0.0 0.7926 0.8011 0.7924 0.6898
6 0.8621 0.0 0.8593 0.8750 0.8644 0.7925
7 0.8966 0.0 0.9000 0.9224 0.8948 0.8455
8 0.7586 0.0 0.7593 0.7893 0.7655 0.6394
9 0.8621 0.0 0.8556 0.8637 0.8566 0.7921
Mean 0.8439 0.0 0.8433 0.8605 0.8450 0.7659
SD 0.0388 0.0 0.0388 0.0391 0.0373 0.0579

We can preview the agreement and disagreement of the models in a plot, this is not present for every model.

In [44]:
stack_soft = stack_models([rf,egb, lgb, gb, cb], plot=True)
Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7607 0.7387 0.6000
1 0.8333 0.0 0.8333 0.8350 0.8329 0.7500
2 0.8667 0.0 0.8667 0.8674 0.8624 0.8000
3 0.8000 0.0 0.8000 0.8148 0.8038 0.7000
4 0.8333 0.0 0.8333 0.8485 0.8360 0.7500
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8556 0.8695 0.8597 0.7921
7 0.6897 0.0 0.6926 0.6931 0.6749 0.5364
8 0.7586 0.0 0.7556 0.7686 0.7623 0.6381
9 0.7241 0.0 0.7222 0.7303 0.7260 0.5865
Mean 0.7894 0.0 0.7889 0.8017 0.7886 0.6844
SD 0.0579 0.0 0.0570 0.0573 0.0592 0.0864
In [45]:
stack_hard = stack_models([rf,egb, lgb, gb, cb],  method='hard', plot=True)
Accuracy AUC Recall Prec. F1 Kappa
0 0.7333 0.0 0.7333 0.7607 0.7387 0.6000
1 0.8333 0.0 0.8333 0.8350 0.8329 0.7500
2 0.8667 0.0 0.8667 0.8674 0.8624 0.8000
3 0.8000 0.0 0.8000 0.8148 0.8038 0.7000
4 0.8333 0.0 0.8333 0.8485 0.8360 0.7500
5 0.7931 0.0 0.7963 0.8292 0.7890 0.6904
6 0.8621 0.0 0.8556 0.8695 0.8597 0.7921
7 0.6897 0.0 0.6926 0.6931 0.6749 0.5364
8 0.7586 0.0 0.7556 0.7686 0.7623 0.6381
9 0.7241 0.0 0.7222 0.7303 0.7260 0.5865
Mean 0.7894 0.0 0.7889 0.8017 0.7886 0.6844
SD 0.0579 0.0 0.0570 0.0573 0.0592 0.0864

Test the models on a holdout sample

Finally we will test some of the models on the test sample. The predictions can be viewed below.

In [46]:
gb_holdout_pred = predict_model(gb)
gb_holdout_pred.head()
Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.7559 0 0.7551 0.757 0.7564 0.6339
Out[46]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9896
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.9908
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9869
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9968
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.9850

5 rows × 49 columns

In [47]:
lgb_holdout_pred = predict_model(lgb)
lgb_holdout_pred.head()
Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.7717 0 0.7709 0.7835 0.7752 0.6576
Out[47]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9993
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.9900
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9991
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9995
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.9852

5 rows × 49 columns

In [48]:
tuned_gb_holdout_pred = predict_model(tuned_gb)
tuned_gb_holdout_pred.head()
Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.8031 0 0.8021 0.801 0.8015 0.7046
Out[48]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9974
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 1.0000
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.9940
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 1.0000
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 1.0000

5 rows × 49 columns

In [49]:
rf_holdout_pred = predict_model(tuned_rf)
rf_holdout_pred.head()
Model Accuracy AUC Recall Prec. F1 Kappa
0 One Vs Rest Classifier 0.8031 0 0.8021 0.8016 0.7988 0.7045
Out[49]:
x4 x9 x11 x12 x15 x17 x18 x19 x22 x23 ... x72 x73 x74 x78 x80 x82 x83 status Label Score
0 -0.235880 -0.962146 0.303608 -0.186713 -0.273263 -0.207795 -1.845254 -0.005883 -0.449907 -1.406979 ... 0.863079 1.862990 0.875683 0.662868 0.789812 0.713907 -0.025607 0 0 0.9315
1 0.294453 0.456432 0.500978 -0.186581 -0.270637 -0.205210 -0.285810 -0.050554 -0.068685 0.264763 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 -0.183361 -1.277243 2 1 0.7414
2 0.234019 1.956273 -1.621492 -0.186284 4.162752 -0.200972 -0.261811 -0.077996 -0.926736 0.901133 ... -0.256419 -0.544432 -0.065215 -0.246617 -0.672648 -0.376299 -1.666447 2 2 0.8973
3 0.232117 -1.710759 0.742225 -0.186687 -0.271420 -0.216866 -1.163790 -0.064286 -0.304550 -0.478708 ... 1.421555 1.862951 0.552170 0.662770 -0.794697 0.663312 -0.062555 0 0 0.9752
4 -0.477883 -0.102368 -0.513372 -0.186612 -0.272248 -0.209478 -0.182208 -0.037135 -0.205190 0.026181 ... 0.113475 -0.111385 -0.150287 -0.180539 1.225575 1.018293 0.804925 1 1 0.8137

5 rows × 49 columns

Conclusion

The models produced by the PyCaret library have better results then similar models produces by sklearn. The differences between the training and testing scores are smaller.Some of the ensemble methods create better models other worsen the results. There instances that the test accuracy can be up to 15% higher using PyCaret. This discrepancy is suspicious and it needs to be investigated further. Due to the fact that sklearn is older and more widely used we will give it a higher faith for the results. For further research on the topic there needs to be a greater investigation of the methods and documentation applied by PyCaret to find the reasons behind the stark differences.

In [ ]:

Share this

Leave a Reply