Home Credit Group is a financial institution which specializes in consumer lending, especially to people with little credit history. In order to determine what a reasonable principal is for applicants, and a repayment schedule which will help their clients sucessfully repay their loans, Home Credit Group wants to use data about the applicant to predict how likely they are to be able to repay their loan. Home Credit Group recently hosted a kaggle competition to predict loan repayment probability from (anonymized) applicant information. In this post we’ll use that data to try and predict loan repayment ability.
Outline
- Data Loading and Cleaning
- Manual Feature Engineering
- Feature Encoding
- Baseline Predictions
- Calibration
- Resampling
- Final Predictions and Feature Importance
First let’s load the packages we’ll use.
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import auc, roc_curve, roc_auc_score, make_scorer
from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from xgboost import XGBClassifier
from xgboost import plot_importance
from hashlib import sha256
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline
# Plot settings
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
sns.set()
Data Loading and Cleaning
Let’s load both the training and test data.
# Load applications data
train = pd.read_csv('application_train.csv')
test = pd.read_csv('application_test.csv')
And now we can take a look at the data we’re working with.
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ... | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
# Print info about each column in the train dataset
for col in train:
print(col)
Nnan = train[col].isnull().sum()
print('Number empty: ', Nnan)
print('Percent empty: ', 100*Nnan/train.shape[0])
print(train[col].describe())
if train[col].dtype==object:
print('Categories and Count:')
print(train[col].value_counts().to_string(header=None))
print()
SK_ID_CURR
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 278180.518577
std 102790.175348
min 100002.000000
25% 189145.500000
50% 278202.000000
75% 367142.500000
max 456255.000000
Name: SK_ID_CURR, dtype: float64
TARGET
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.080729
std 0.272419
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: TARGET, dtype: float64
NAME_CONTRACT_TYPE
Number empty: 0
Percent empty: 0.0
count 307511
unique 2
top Cash loans
freq 278232
Name: NAME_CONTRACT_TYPE, dtype: object
Categories and Count:
Cash loans 278232
Revolving loans 29279
CODE_GENDER
Number empty: 0
Percent empty: 0.0
count 307511
unique 3
top F
freq 202448
Name: CODE_GENDER, dtype: object
Categories and Count:
F 202448
M 105059
XNA 4
FLAG_OWN_CAR
Number empty: 0
Percent empty: 0.0
count 307511
unique 2
top N
freq 202924
Name: FLAG_OWN_CAR, dtype: object
Categories and Count:
N 202924
Y 104587
FLAG_OWN_REALTY
Number empty: 0
Percent empty: 0.0
count 307511
unique 2
top Y
freq 213312
Name: FLAG_OWN_REALTY, dtype: object
Categories and Count:
Y 213312
N 94199
CNT_CHILDREN
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.417052
std 0.722121
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 19.000000
Name: CNT_CHILDREN, dtype: float64
AMT_INCOME_TOTAL
Number empty: 0
Percent empty: 0.0
count 3.075110e+05
mean 1.687979e+05
std 2.371231e+05
min 2.565000e+04
25% 1.125000e+05
50% 1.471500e+05
75% 2.025000e+05
max 1.170000e+08
Name: AMT_INCOME_TOTAL, dtype: float64
AMT_CREDIT
Number empty: 0
Percent empty: 0.0
count 3.075110e+05
mean 5.990260e+05
std 4.024908e+05
min 4.500000e+04
25% 2.700000e+05
50% 5.135310e+05
75% 8.086500e+05
max 4.050000e+06
Name: AMT_CREDIT, dtype: float64
AMT_ANNUITY
Number empty: 12
Percent empty: 0.0039022994299390914
count 307499.000000
mean 27108.573909
std 14493.737315
min 1615.500000
25% 16524.000000
50% 24903.000000
75% 34596.000000
max 258025.500000
Name: AMT_ANNUITY, dtype: float64
AMT_GOODS_PRICE
Number empty: 278
Percent empty: 0.09040327012692229
count 3.072330e+05
mean 5.383962e+05
std 3.694465e+05
min 4.050000e+04
25% 2.385000e+05
50% 4.500000e+05
75% 6.795000e+05
max 4.050000e+06
Name: AMT_GOODS_PRICE, dtype: float64
NAME_TYPE_SUITE
Number empty: 1292
Percent empty: 0.42014757195677555
count 306219
unique 7
top Unaccompanied
freq 248526
Name: NAME_TYPE_SUITE, dtype: object
Categories and Count:
Unaccompanied 248526
Family 40149
Spouse, partner 11370
Children 3267
Other_B 1770
Other_A 866
Group of people 271
NAME_INCOME_TYPE
Number empty: 0
Percent empty: 0.0
count 307511
unique 8
top Working
freq 158774
Name: NAME_INCOME_TYPE, dtype: object
Categories and Count:
Working 158774
Commercial associate 71617
Pensioner 55362
State servant 21703
Unemployed 22
Student 18
Businessman 10
Maternity leave 5
NAME_EDUCATION_TYPE
Number empty: 0
Percent empty: 0.0
count 307511
unique 5
top Secondary / secondary special
freq 218391
Name: NAME_EDUCATION_TYPE, dtype: object
Categories and Count:
Secondary / secondary special 218391
Higher education 74863
Incomplete higher 10277
Lower secondary 3816
Academic degree 164
NAME_FAMILY_STATUS
Number empty: 0
Percent empty: 0.0
count 307511
unique 6
top Married
freq 196432
Name: NAME_FAMILY_STATUS, dtype: object
Categories and Count:
Married 196432
Single / not married 45444
Civil marriage 29775
Separated 19770
Widow 16088
Unknown 2
NAME_HOUSING_TYPE
Number empty: 0
Percent empty: 0.0
count 307511
unique 6
top House / apartment
freq 272868
Name: NAME_HOUSING_TYPE, dtype: object
Categories and Count:
House / apartment 272868
With parents 14840
Municipal apartment 11183
Rented apartment 4881
Office apartment 2617
Co-op apartment 1122
REGION_POPULATION_RELATIVE
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.020868
std 0.013831
min 0.000290
25% 0.010006
50% 0.018850
75% 0.028663
max 0.072508
Name: REGION_POPULATION_RELATIVE, dtype: float64
DAYS_BIRTH
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean -16036.995067
std 4363.988632
min -25229.000000
25% -19682.000000
50% -15750.000000
75% -12413.000000
max -7489.000000
Name: DAYS_BIRTH, dtype: float64
DAYS_EMPLOYED
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 63815.045904
std 141275.766519
min -17912.000000
25% -2760.000000
50% -1213.000000
75% -289.000000
max 365243.000000
Name: DAYS_EMPLOYED, dtype: float64
DAYS_REGISTRATION
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean -4986.120328
std 3522.886321
min -24672.000000
25% -7479.500000
50% -4504.000000
75% -2010.000000
max 0.000000
Name: DAYS_REGISTRATION, dtype: float64
DAYS_ID_PUBLISH
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean -2994.202373
std 1509.450419
min -7197.000000
25% -4299.000000
50% -3254.000000
75% -1720.000000
max 0.000000
Name: DAYS_ID_PUBLISH, dtype: float64
OWN_CAR_AGE
Number empty: 202929
Percent empty: 65.9908100848425
count 104582.000000
mean 12.061091
std 11.944812
min 0.000000
25% 5.000000
50% 9.000000
75% 15.000000
max 91.000000
Name: OWN_CAR_AGE, dtype: float64
FLAG_MOBIL
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.999997
std 0.001803
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_MOBIL, dtype: float64
FLAG_EMP_PHONE
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.819889
std 0.384280
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_EMP_PHONE, dtype: float64
FLAG_WORK_PHONE
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.199368
std 0.399526
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_WORK_PHONE, dtype: float64
FLAG_CONT_MOBILE
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.998133
std 0.043164
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_CONT_MOBILE, dtype: float64
FLAG_PHONE
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.281066
std 0.449521
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
Name: FLAG_PHONE, dtype: float64
FLAG_EMAIL
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.056720
std 0.231307
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_EMAIL, dtype: float64
OCCUPATION_TYPE
Number empty: 96391
Percent empty: 31.345545362604916
count 211120
unique 18
top Laborers
freq 55186
Name: OCCUPATION_TYPE, dtype: object
Categories and Count:
Laborers 55186
Sales staff 32102
Core staff 27570
Managers 21371
Drivers 18603
High skill tech staff 11380
Accountants 9813
Medicine staff 8537
Security staff 6721
Cooking staff 5946
Cleaning staff 4653
Private service staff 2652
Low-skill Laborers 2093
Waiters/barmen staff 1348
Secretaries 1305
Realty agents 751
HR staff 563
IT staff 526
CNT_FAM_MEMBERS
Number empty: 2
Percent empty: 0.000650383238323182
count 307509.000000
mean 2.152665
std 0.910682
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 20.000000
Name: CNT_FAM_MEMBERS, dtype: float64
REGION_RATING_CLIENT
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 2.052463
std 0.509034
min 1.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 3.000000
Name: REGION_RATING_CLIENT, dtype: float64
REGION_RATING_CLIENT_W_CITY
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 2.031521
std 0.502737
min 1.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 3.000000
Name: REGION_RATING_CLIENT_W_CITY, dtype: float64
WEEKDAY_APPR_PROCESS_START
Number empty: 0
Percent empty: 0.0
count 307511
unique 7
top TUESDAY
freq 53901
Name: WEEKDAY_APPR_PROCESS_START, dtype: object
Categories and Count:
TUESDAY 53901
WEDNESDAY 51934
MONDAY 50714
THURSDAY 50591
FRIDAY 50338
SATURDAY 33852
SUNDAY 16181
HOUR_APPR_PROCESS_START
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 12.063419
std 3.265832
min 0.000000
25% 10.000000
50% 12.000000
75% 14.000000
max 23.000000
Name: HOUR_APPR_PROCESS_START, dtype: float64
REG_REGION_NOT_LIVE_REGION
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.015144
std 0.122126
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_REGION_NOT_LIVE_REGION, dtype: float64
REG_REGION_NOT_WORK_REGION
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.050769
std 0.219526
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_REGION_NOT_WORK_REGION, dtype: float64
LIVE_REGION_NOT_WORK_REGION
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.040659
std 0.197499
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: LIVE_REGION_NOT_WORK_REGION, dtype: float64
REG_CITY_NOT_LIVE_CITY
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.078173
std 0.268444
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_CITY_NOT_LIVE_CITY, dtype: float64
REG_CITY_NOT_WORK_CITY
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.230454
std 0.421124
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_CITY_NOT_WORK_CITY, dtype: float64
LIVE_CITY_NOT_WORK_CITY
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.179555
std 0.383817
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: LIVE_CITY_NOT_WORK_CITY, dtype: float64
ORGANIZATION_TYPE
Number empty: 0
Percent empty: 0.0
count 307511
unique 58
top Business Entity Type 3
freq 67992
Name: ORGANIZATION_TYPE, dtype: object
Categories and Count:
Business Entity Type 3 67992
XNA 55374
Self-employed 38412
Other 16683
Medicine 11193
Business Entity Type 2 10553
Government 10404
School 8893
Trade: type 7 7831
Kindergarten 6880
Construction 6721
Business Entity Type 1 5984
Transport: type 4 5398
Trade: type 3 3492
Industry: type 9 3368
Industry: type 3 3278
Security 3247
Housing 2958
Industry: type 11 2704
Military 2634
Bank 2507
Agriculture 2454
Police 2341
Transport: type 2 2204
Postal 2157
Security Ministries 1974
Trade: type 2 1900
Restaurant 1811
Services 1575
University 1327
Industry: type 7 1307
Transport: type 3 1187
Industry: type 1 1039
Hotel 966
Electricity 950
Industry: type 4 877
Trade: type 6 631
Industry: type 5 599
Insurance 597
Telecom 577
Emergency 560
Industry: type 2 458
Advertising 429
Realtor 396
Culture 379
Industry: type 12 369
Trade: type 1 348
Mobile 317
Legal Services 305
Cleaning 260
Transport: type 1 201
Industry: type 6 112
Industry: type 10 109
Religion 85
Industry: type 13 67
Trade: type 4 64
Trade: type 5 49
Industry: type 8 24
EXT_SOURCE_1
Number empty: 173378
Percent empty: 56.38107254699832
count 134133.000000
mean 0.502130
std 0.211062
min 0.014568
25% 0.334007
50% 0.505998
75% 0.675053
max 0.962693
Name: EXT_SOURCE_1, dtype: float64
EXT_SOURCE_2
Number empty: 660
Percent empty: 0.21462646864665003
count 3.068510e+05
mean 5.143927e-01
std 1.910602e-01
min 8.173617e-08
25% 3.924574e-01
50% 5.659614e-01
75% 6.636171e-01
max 8.549997e-01
Name: EXT_SOURCE_2, dtype: float64
EXT_SOURCE_3
Number empty: 60965
Percent empty: 19.825307062186393
count 246546.000000
mean 0.510853
std 0.194844
min 0.000527
25% 0.370650
50% 0.535276
75% 0.669057
max 0.896010
Name: EXT_SOURCE_3, dtype: float64
APARTMENTS_AVG
Number empty: 156061
Percent empty: 50.749729277977046
count 151450.00000
mean 0.11744
std 0.10824
min 0.00000
25% 0.05770
50% 0.08760
75% 0.14850
max 1.00000
Name: APARTMENTS_AVG, dtype: float64
BASEMENTAREA_AVG
Number empty: 179943
Percent empty: 58.515955526794166
count 127568.000000
mean 0.088442
std 0.082438
min 0.000000
25% 0.044200
50% 0.076300
75% 0.112200
max 1.000000
Name: BASEMENTAREA_AVG, dtype: float64
YEARS_BEGINEXPLUATATION_AVG
Number empty: 150007
Percent empty: 48.781019215572776
count 157504.000000
mean 0.977735
std 0.059223
min 0.000000
25% 0.976700
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_AVG, dtype: float64
YEARS_BUILD_AVG
Number empty: 204488
Percent empty: 66.49778381911541
count 103023.000000
mean 0.752471
std 0.113280
min 0.000000
25% 0.687200
50% 0.755200
75% 0.823200
max 1.000000
Name: YEARS_BUILD_AVG, dtype: float64
COMMONAREA_AVG
Number empty: 214865
Percent empty: 69.87229725115525
count 92646.000000
mean 0.044621
std 0.076036
min 0.000000
25% 0.007800
50% 0.021100
75% 0.051500
max 1.000000
Name: COMMONAREA_AVG, dtype: float64
ELEVATORS_AVG
Number empty: 163891
Percent empty: 53.29597965601231
count 143620.000000
mean 0.078942
std 0.134576
min 0.000000
25% 0.000000
50% 0.000000
75% 0.120000
max 1.000000
Name: ELEVATORS_AVG, dtype: float64
ENTRANCES_AVG
Number empty: 154828
Percent empty: 50.34876801155081
count 152683.000000
mean 0.149725
std 0.100049
min 0.000000
25% 0.069000
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_AVG, dtype: float64
FLOORSMAX_AVG
Number empty: 153020
Percent empty: 49.76082156410665
count 154491.000000
mean 0.226282
std 0.144641
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_AVG, dtype: float64
FLOORSMIN_AVG
Number empty: 208642
Percent empty: 67.84862980511267
count 98869.000000
mean 0.231894
std 0.161380
min 0.000000
25% 0.083300
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_AVG, dtype: float64
LANDAREA_AVG
Number empty: 182590
Percent empty: 59.376737742714894
count 124921.000000
mean 0.066333
std 0.081184
min 0.000000
25% 0.018700
50% 0.048100
75% 0.085600
max 1.000000
Name: LANDAREA_AVG, dtype: float64
LIVINGAPARTMENTS_AVG
Number empty: 210199
Percent empty: 68.35495315614726
count 97312.000000
mean 0.100775
std 0.092576
min 0.000000
25% 0.050400
50% 0.075600
75% 0.121000
max 1.000000
Name: LIVINGAPARTMENTS_AVG, dtype: float64
LIVINGAREA_AVG
Number empty: 154350
Percent empty: 50.193326417591564
count 153161.000000
mean 0.107399
std 0.110565
min 0.000000
25% 0.045300
50% 0.074500
75% 0.129900
max 1.000000
Name: LIVINGAREA_AVG, dtype: float64
NONLIVINGAPARTMENTS_AVG
Number empty: 213514
Percent empty: 69.43296337366793
count 93997.000000
mean 0.008809
std 0.047732
min 0.000000
25% 0.000000
50% 0.000000
75% 0.003900
max 1.000000
Name: NONLIVINGAPARTMENTS_AVG, dtype: float64
NONLIVINGAREA_AVG
Number empty: 169682
Percent empty: 55.17916432257708
count 137829.000000
mean 0.028358
std 0.069523
min 0.000000
25% 0.000000
50% 0.003600
75% 0.027700
max 1.000000
Name: NONLIVINGAREA_AVG, dtype: float64
APARTMENTS_MODE
Number empty: 156061
Percent empty: 50.749729277977046
count 151450.000000
mean 0.114231
std 0.107936
min 0.000000
25% 0.052500
50% 0.084000
75% 0.143900
max 1.000000
Name: APARTMENTS_MODE, dtype: float64
BASEMENTAREA_MODE
Number empty: 179943
Percent empty: 58.515955526794166
count 127568.000000
mean 0.087543
std 0.084307
min 0.000000
25% 0.040700
50% 0.074600
75% 0.112400
max 1.000000
Name: BASEMENTAREA_MODE, dtype: float64
YEARS_BEGINEXPLUATATION_MODE
Number empty: 150007
Percent empty: 48.781019215572776
count 157504.000000
mean 0.977065
std 0.064575
min 0.000000
25% 0.976700
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_MODE, dtype: float64
YEARS_BUILD_MODE
Number empty: 204488
Percent empty: 66.49778381911541
count 103023.000000
mean 0.759637
std 0.110111
min 0.000000
25% 0.699400
50% 0.764800
75% 0.823600
max 1.000000
Name: YEARS_BUILD_MODE, dtype: float64
COMMONAREA_MODE
Number empty: 214865
Percent empty: 69.87229725115525
count 92646.000000
mean 0.042553
std 0.074445
min 0.000000
25% 0.007200
50% 0.019000
75% 0.049000
max 1.000000
Name: COMMONAREA_MODE, dtype: float64
ELEVATORS_MODE
Number empty: 163891
Percent empty: 53.29597965601231
count 143620.000000
mean 0.074490
std 0.132256
min 0.000000
25% 0.000000
50% 0.000000
75% 0.120800
max 1.000000
Name: ELEVATORS_MODE, dtype: float64
ENTRANCES_MODE
Number empty: 154828
Percent empty: 50.34876801155081
count 152683.000000
mean 0.145193
std 0.100977
min 0.000000
25% 0.069000
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_MODE, dtype: float64
FLOORSMAX_MODE
Number empty: 153020
Percent empty: 49.76082156410665
count 154491.000000
mean 0.222315
std 0.143709
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_MODE, dtype: float64
FLOORSMIN_MODE
Number empty: 208642
Percent empty: 67.84862980511267
count 98869.000000
mean 0.228058
std 0.161160
min 0.000000
25% 0.083300
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_MODE, dtype: float64
LANDAREA_MODE
Number empty: 182590
Percent empty: 59.376737742714894
count 124921.000000
mean 0.064958
std 0.081750
min 0.000000
25% 0.016600
50% 0.045800
75% 0.084100
max 1.000000
Name: LANDAREA_MODE, dtype: float64
LIVINGAPARTMENTS_MODE
Number empty: 210199
Percent empty: 68.35495315614726
count 97312.000000
mean 0.105645
std 0.097880
min 0.000000
25% 0.054200
50% 0.077100
75% 0.131300
max 1.000000
Name: LIVINGAPARTMENTS_MODE, dtype: float64
LIVINGAREA_MODE
Number empty: 154350
Percent empty: 50.193326417591564
count 153161.000000
mean 0.105975
std 0.111845
min 0.000000
25% 0.042700
50% 0.073100
75% 0.125200
max 1.000000
Name: LIVINGAREA_MODE, dtype: float64
NONLIVINGAPARTMENTS_MODE
Number empty: 213514
Percent empty: 69.43296337366793
count 93997.000000
mean 0.008076
std 0.046276
min 0.000000
25% 0.000000
50% 0.000000
75% 0.003900
max 1.000000
Name: NONLIVINGAPARTMENTS_MODE, dtype: float64
NONLIVINGAREA_MODE
Number empty: 169682
Percent empty: 55.17916432257708
count 137829.000000
mean 0.027022
std 0.070254
min 0.000000
25% 0.000000
50% 0.001100
75% 0.023100
max 1.000000
Name: NONLIVINGAREA_MODE, dtype: float64
APARTMENTS_MEDI
Number empty: 156061
Percent empty: 50.749729277977046
count 151450.000000
mean 0.117850
std 0.109076
min 0.000000
25% 0.058300
50% 0.086400
75% 0.148900
max 1.000000
Name: APARTMENTS_MEDI, dtype: float64
BASEMENTAREA_MEDI
Number empty: 179943
Percent empty: 58.515955526794166
count 127568.000000
mean 0.087955
std 0.082179
min 0.000000
25% 0.043700
50% 0.075800
75% 0.111600
max 1.000000
Name: BASEMENTAREA_MEDI, dtype: float64
YEARS_BEGINEXPLUATATION_MEDI
Number empty: 150007
Percent empty: 48.781019215572776
count 157504.000000
mean 0.977752
std 0.059897
min 0.000000
25% 0.976700
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_MEDI, dtype: float64
YEARS_BUILD_MEDI
Number empty: 204488
Percent empty: 66.49778381911541
count 103023.000000
mean 0.755746
std 0.112066
min 0.000000
25% 0.691400
50% 0.758500
75% 0.825600
max 1.000000
Name: YEARS_BUILD_MEDI, dtype: float64
COMMONAREA_MEDI
Number empty: 214865
Percent empty: 69.87229725115525
count 92646.000000
mean 0.044595
std 0.076144
min 0.000000
25% 0.007900
50% 0.020800
75% 0.051300
max 1.000000
Name: COMMONAREA_MEDI, dtype: float64
ELEVATORS_MEDI
Number empty: 163891
Percent empty: 53.29597965601231
count 143620.000000
mean 0.078078
std 0.134467
min 0.000000
25% 0.000000
50% 0.000000
75% 0.120000
max 1.000000
Name: ELEVATORS_MEDI, dtype: float64
ENTRANCES_MEDI
Number empty: 154828
Percent empty: 50.34876801155081
count 152683.000000
mean 0.149213
std 0.100368
min 0.000000
25% 0.069000
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_MEDI, dtype: float64
FLOORSMAX_MEDI
Number empty: 153020
Percent empty: 49.76082156410665
count 154491.000000
mean 0.225897
std 0.145067
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_MEDI, dtype: float64
FLOORSMIN_MEDI
Number empty: 208642
Percent empty: 67.84862980511267
count 98869.000000
mean 0.231625
std 0.161934
min 0.000000
25% 0.083300
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_MEDI, dtype: float64
LANDAREA_MEDI
Number empty: 182590
Percent empty: 59.376737742714894
count 124921.000000
mean 0.067169
std 0.082167
min 0.000000
25% 0.018700
50% 0.048700
75% 0.086800
max 1.000000
Name: LANDAREA_MEDI, dtype: float64
LIVINGAPARTMENTS_MEDI
Number empty: 210199
Percent empty: 68.35495315614726
count 97312.000000
mean 0.101954
std 0.093642
min 0.000000
25% 0.051300
50% 0.076100
75% 0.123100
max 1.000000
Name: LIVINGAPARTMENTS_MEDI, dtype: float64
LIVINGAREA_MEDI
Number empty: 154350
Percent empty: 50.193326417591564
count 153161.000000
mean 0.108607
std 0.112260
min 0.000000
25% 0.045700
50% 0.074900
75% 0.130300
max 1.000000
Name: LIVINGAREA_MEDI, dtype: float64
NONLIVINGAPARTMENTS_MEDI
Number empty: 213514
Percent empty: 69.43296337366793
count 93997.000000
mean 0.008651
std 0.047415
min 0.000000
25% 0.000000
50% 0.000000
75% 0.003900
max 1.000000
Name: NONLIVINGAPARTMENTS_MEDI, dtype: float64
NONLIVINGAREA_MEDI
Number empty: 169682
Percent empty: 55.17916432257708
count 137829.000000
mean 0.028236
std 0.070166
min 0.000000
25% 0.000000
50% 0.003100
75% 0.026600
max 1.000000
Name: NONLIVINGAREA_MEDI, dtype: float64
FONDKAPREMONT_MODE
Number empty: 210295
Percent empty: 68.38617155158677
count 97216
unique 4
top reg oper account
freq 73830
Name: FONDKAPREMONT_MODE, dtype: object
Categories and Count:
reg oper account 73830
reg oper spec account 12080
not specified 5687
org spec account 5619
HOUSETYPE_MODE
Number empty: 154297
Percent empty: 50.176091261776
count 153214
unique 3
top block of flats
freq 150503
Name: HOUSETYPE_MODE, dtype: object
Categories and Count:
block of flats 150503
specific housing 1499
terraced house 1212
TOTALAREA_MODE
Number empty: 148431
Percent empty: 48.26851722377411
count 159080.000000
mean 0.102547
std 0.107462
min 0.000000
25% 0.041200
50% 0.068800
75% 0.127600
max 1.000000
Name: TOTALAREA_MODE, dtype: float64
WALLSMATERIAL_MODE
Number empty: 156341
Percent empty: 50.8407829313423
count 151170
unique 7
top Panel
freq 66040
Name: WALLSMATERIAL_MODE, dtype: object
Categories and Count:
Panel 66040
Stone, brick 64815
Block 9253
Wooden 5362
Mixed 2296
Monolithic 1779
Others 1625
EMERGENCYSTATE_MODE
Number empty: 145755
Percent empty: 47.39830445089769
count 161756
unique 2
top No
freq 159428
Name: EMERGENCYSTATE_MODE, dtype: object
Categories and Count:
No 159428
Yes 2328
OBS_30_CNT_SOCIAL_CIRCLE
Number empty: 1021
Percent empty: 0.3320206431639844
count 306490.000000
mean 1.422245
std 2.400989
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 348.000000
Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64
DEF_30_CNT_SOCIAL_CIRCLE
Number empty: 1021
Percent empty: 0.3320206431639844
count 306490.000000
mean 0.143421
std 0.446698
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 34.000000
Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64
OBS_60_CNT_SOCIAL_CIRCLE
Number empty: 1021
Percent empty: 0.3320206431639844
count 306490.000000
mean 1.405292
std 2.379803
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 344.000000
Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64
DEF_60_CNT_SOCIAL_CIRCLE
Number empty: 1021
Percent empty: 0.3320206431639844
count 306490.000000
mean 0.100049
std 0.362291
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 24.000000
Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64
DAYS_LAST_PHONE_CHANGE
Number empty: 1
Percent empty: 0.000325191619161591
count 307510.000000
mean -962.858788
std 826.808487
min -4292.000000
25% -1570.000000
50% -757.000000
75% -274.000000
max 0.000000
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
FLAG_DOCUMENT_2
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000042
std 0.006502
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_2, dtype: float64
FLAG_DOCUMENT_3
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.710023
std 0.453752
min 0.000000
25% 0.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_DOCUMENT_3, dtype: float64
FLAG_DOCUMENT_4
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000081
std 0.009016
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_4, dtype: float64
FLAG_DOCUMENT_5
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.015115
std 0.122010
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_5, dtype: float64
FLAG_DOCUMENT_6
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.088055
std 0.283376
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_6, dtype: float64
FLAG_DOCUMENT_7
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000192
std 0.013850
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_7, dtype: float64
FLAG_DOCUMENT_8
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.081376
std 0.273412
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_8, dtype: float64
FLAG_DOCUMENT_9
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.003896
std 0.062295
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_9, dtype: float64
FLAG_DOCUMENT_10
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000023
std 0.004771
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_10, dtype: float64
FLAG_DOCUMENT_11
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.003912
std 0.062424
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_11, dtype: float64
FLAG_DOCUMENT_12
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000007
std 0.002550
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_12, dtype: float64
FLAG_DOCUMENT_13
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.003525
std 0.059268
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_13, dtype: float64
FLAG_DOCUMENT_14
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.002936
std 0.054110
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_14, dtype: float64
FLAG_DOCUMENT_15
Number empty: 0
Percent empty: 0.0
count 307511.00000
mean 0.00121
std 0.03476
min 0.00000
25% 0.00000
50% 0.00000
75% 0.00000
max 1.00000
Name: FLAG_DOCUMENT_15, dtype: float64
FLAG_DOCUMENT_16
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.009928
std 0.099144
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_16, dtype: float64
FLAG_DOCUMENT_17
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000267
std 0.016327
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_17, dtype: float64
FLAG_DOCUMENT_18
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.008130
std 0.089798
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_18, dtype: float64
FLAG_DOCUMENT_19
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000595
std 0.024387
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_19, dtype: float64
FLAG_DOCUMENT_20
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000507
std 0.022518
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_20, dtype: float64
FLAG_DOCUMENT_21
Number empty: 0
Percent empty: 0.0
count 307511.000000
mean 0.000335
std 0.018299
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_21, dtype: float64
AMT_REQ_CREDIT_BUREAU_HOUR
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 0.006402
std 0.083849
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 4.000000
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
AMT_REQ_CREDIT_BUREAU_DAY
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 0.007000
std 0.110757
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 9.000000
Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
AMT_REQ_CREDIT_BUREAU_WEEK
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 0.034362
std 0.204685
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 8.000000
Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64
AMT_REQ_CREDIT_BUREAU_MON
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 0.267395
std 0.916002
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 27.000000
Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64
AMT_REQ_CREDIT_BUREAU_QRT
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 0.265474
std 0.794056
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 261.000000
Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
AMT_REQ_CREDIT_BUREAU_YEAR
Number empty: 41519
Percent empty: 13.501630835970095
count 265992.000000
mean 1.899974
std 1.869295
min 0.000000
25% 0.000000
50% 1.000000
75% 3.000000
max 25.000000
Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64
# Print info about each column in the test dataset
for col in test:
print(col)
Nnan = test[col].isnull().sum()
print('Number empty: ', Nnan)
print('Percent empty: ', 100*Nnan/test.shape[0])
print(test[col].describe())
if test[col].dtype==object:
print('Categories and Count:')
print(test[col].value_counts().to_string(header=None))
print()
SK_ID_CURR
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 277796.676350
std 103169.547296
min 100001.000000
25% 188557.750000
50% 277549.000000
75% 367555.500000
max 456250.000000
Name: SK_ID_CURR, dtype: float64
NAME_CONTRACT_TYPE
Number empty: 0
Percent empty: 0.0
count 48744
unique 2
top Cash loans
freq 48305
Name: NAME_CONTRACT_TYPE, dtype: object
Categories and Count:
Cash loans 48305
Revolving loans 439
CODE_GENDER
Number empty: 0
Percent empty: 0.0
count 48744
unique 2
top F
freq 32678
Name: CODE_GENDER, dtype: object
Categories and Count:
F 32678
M 16066
FLAG_OWN_CAR
Number empty: 0
Percent empty: 0.0
count 48744
unique 2
top N
freq 32311
Name: FLAG_OWN_CAR, dtype: object
Categories and Count:
N 32311
Y 16433
FLAG_OWN_REALTY
Number empty: 0
Percent empty: 0.0
count 48744
unique 2
top Y
freq 33658
Name: FLAG_OWN_REALTY, dtype: object
Categories and Count:
Y 33658
N 15086
CNT_CHILDREN
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.397054
std 0.709047
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 20.000000
Name: CNT_CHILDREN, dtype: float64
AMT_INCOME_TOTAL
Number empty: 0
Percent empty: 0.0
count 4.874400e+04
mean 1.784318e+05
std 1.015226e+05
min 2.694150e+04
25% 1.125000e+05
50% 1.575000e+05
75% 2.250000e+05
max 4.410000e+06
Name: AMT_INCOME_TOTAL, dtype: float64
AMT_CREDIT
Number empty: 0
Percent empty: 0.0
count 4.874400e+04
mean 5.167404e+05
std 3.653970e+05
min 4.500000e+04
25% 2.606400e+05
50% 4.500000e+05
75% 6.750000e+05
max 2.245500e+06
Name: AMT_CREDIT, dtype: float64
AMT_ANNUITY
Number empty: 24
Percent empty: 0.049236829148202856
count 48720.000000
mean 29426.240209
std 16016.368315
min 2295.000000
25% 17973.000000
50% 26199.000000
75% 37390.500000
max 180576.000000
Name: AMT_ANNUITY, dtype: float64
AMT_GOODS_PRICE
Number empty: 0
Percent empty: 0.0
count 4.874400e+04
mean 4.626188e+05
std 3.367102e+05
min 4.500000e+04
25% 2.250000e+05
50% 3.960000e+05
75% 6.300000e+05
max 2.245500e+06
Name: AMT_GOODS_PRICE, dtype: float64
NAME_TYPE_SUITE
Number empty: 911
Percent empty: 1.8689479730838667
count 47833
unique 7
top Unaccompanied
freq 39727
Name: NAME_TYPE_SUITE, dtype: object
Categories and Count:
Unaccompanied 39727
Family 5881
Spouse, partner 1448
Children 408
Other_B 211
Other_A 109
Group of people 49
NAME_INCOME_TYPE
Number empty: 0
Percent empty: 0.0
count 48744
unique 7
top Working
freq 24533
Name: NAME_INCOME_TYPE, dtype: object
Categories and Count:
Working 24533
Commercial associate 11402
Pensioner 9273
State servant 3532
Student 2
Businessman 1
Unemployed 1
NAME_EDUCATION_TYPE
Number empty: 0
Percent empty: 0.0
count 48744
unique 5
top Secondary / secondary special
freq 33988
Name: NAME_EDUCATION_TYPE, dtype: object
Categories and Count:
Secondary / secondary special 33988
Higher education 12516
Incomplete higher 1724
Lower secondary 475
Academic degree 41
NAME_FAMILY_STATUS
Number empty: 0
Percent empty: 0.0
count 48744
unique 5
top Married
freq 32283
Name: NAME_FAMILY_STATUS, dtype: object
Categories and Count:
Married 32283
Single / not married 7036
Civil marriage 4261
Separated 2955
Widow 2209
NAME_HOUSING_TYPE
Number empty: 0
Percent empty: 0.0
count 48744
unique 6
top House / apartment
freq 43645
Name: NAME_HOUSING_TYPE, dtype: object
Categories and Count:
House / apartment 43645
With parents 2234
Municipal apartment 1617
Rented apartment 718
Office apartment 407
Co-op apartment 123
REGION_POPULATION_RELATIVE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.021226
std 0.014428
min 0.000253
25% 0.010006
50% 0.018850
75% 0.028663
max 0.072508
Name: REGION_POPULATION_RELATIVE, dtype: float64
DAYS_BIRTH
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean -16068.084605
std 4325.900393
min -25195.000000
25% -19637.000000
50% -15785.000000
75% -12496.000000
max -7338.000000
Name: DAYS_BIRTH, dtype: float64
DAYS_EMPLOYED
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 67485.366322
std 144348.507136
min -17463.000000
25% -2910.000000
50% -1293.000000
75% -296.000000
max 365243.000000
Name: DAYS_EMPLOYED, dtype: float64
DAYS_REGISTRATION
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean -4967.652716
std 3552.612035
min -23722.000000
25% -7459.250000
50% -4490.000000
75% -1901.000000
max 0.000000
Name: DAYS_REGISTRATION, dtype: float64
DAYS_ID_PUBLISH
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean -3051.712949
std 1569.276709
min -6348.000000
25% -4448.000000
50% -3234.000000
75% -1706.000000
max 0.000000
Name: DAYS_ID_PUBLISH, dtype: float64
OWN_CAR_AGE
Number empty: 32312
Percent empty: 66.28918430986378
count 16432.000000
mean 11.786027
std 11.462889
min 0.000000
25% 4.000000
50% 9.000000
75% 15.000000
max 74.000000
Name: OWN_CAR_AGE, dtype: float64
FLAG_MOBIL
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.999979
std 0.004529
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_MOBIL, dtype: float64
FLAG_EMP_PHONE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.809720
std 0.392526
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_EMP_PHONE, dtype: float64
FLAG_WORK_PHONE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.204702
std 0.403488
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_WORK_PHONE, dtype: float64
FLAG_CONT_MOBILE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.998400
std 0.039971
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_CONT_MOBILE, dtype: float64
FLAG_PHONE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.263130
std 0.440337
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
Name: FLAG_PHONE, dtype: float64
FLAG_EMAIL
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.162646
std 0.369046
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_EMAIL, dtype: float64
OCCUPATION_TYPE
Number empty: 15605
Percent empty: 32.014196619071065
count 33139
unique 18
top Laborers
freq 8655
Name: OCCUPATION_TYPE, dtype: object
Categories and Count:
Laborers 8655
Sales staff 5072
Core staff 4361
Managers 3574
Drivers 2773
High skill tech staff 1854
Accountants 1628
Medicine staff 1316
Security staff 915
Cooking staff 894
Cleaning staff 656
Private service staff 455
Low-skill Laborers 272
Secretaries 213
Waiters/barmen staff 178
Realty agents 138
HR staff 104
IT staff 81
CNT_FAM_MEMBERS
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 2.146767
std 0.890423
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 21.000000
Name: CNT_FAM_MEMBERS, dtype: float64
REGION_RATING_CLIENT
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 2.038159
std 0.522694
min 1.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 3.000000
Name: REGION_RATING_CLIENT, dtype: float64
REGION_RATING_CLIENT_W_CITY
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 2.012596
std 0.515804
min -1.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 3.000000
Name: REGION_RATING_CLIENT_W_CITY, dtype: float64
WEEKDAY_APPR_PROCESS_START
Number empty: 0
Percent empty: 0.0
count 48744
unique 7
top TUESDAY
freq 9751
Name: WEEKDAY_APPR_PROCESS_START, dtype: object
Categories and Count:
TUESDAY 9751
WEDNESDAY 8457
THURSDAY 8418
MONDAY 8406
FRIDAY 7250
SATURDAY 4603
SUNDAY 1859
HOUR_APPR_PROCESS_START
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 12.007365
std 3.278172
min 0.000000
25% 10.000000
50% 12.000000
75% 14.000000
max 23.000000
Name: HOUR_APPR_PROCESS_START, dtype: float64
REG_REGION_NOT_LIVE_REGION
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.018833
std 0.135937
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_REGION_NOT_LIVE_REGION, dtype: float64
REG_REGION_NOT_WORK_REGION
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.055166
std 0.228306
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_REGION_NOT_WORK_REGION, dtype: float64
LIVE_REGION_NOT_WORK_REGION
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.042036
std 0.200673
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: LIVE_REGION_NOT_WORK_REGION, dtype: float64
REG_CITY_NOT_LIVE_CITY
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.077466
std 0.267332
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_CITY_NOT_LIVE_CITY, dtype: float64
REG_CITY_NOT_WORK_CITY
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.224664
std 0.417365
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: REG_CITY_NOT_WORK_CITY, dtype: float64
LIVE_CITY_NOT_WORK_CITY
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.174216
std 0.379299
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: LIVE_CITY_NOT_WORK_CITY, dtype: float64
ORGANIZATION_TYPE
Number empty: 0
Percent empty: 0.0
count 48744
unique 58
top Business Entity Type 3
freq 10840
Name: ORGANIZATION_TYPE, dtype: object
Categories and Count:
Business Entity Type 3 10840
XNA 9274
Self-employed 5920
Other 2707
Medicine 1716
Government 1508
Business Entity Type 2 1479
Trade: type 7 1303
School 1287
Construction 1039
Kindergarten 1038
Business Entity Type 1 887
Transport: type 4 884
Trade: type 3 578
Military 530
Industry: type 9 499
Industry: type 3 489
Security 472
Transport: type 2 448
Police 441
Housing 435
Industry: type 11 416
Bank 374
Security Ministries 341
Services 302
Postal 294
Agriculture 292
Restaurant 284
Trade: type 2 242
University 221
Industry: type 7 217
Industry: type 1 178
Transport: type 3 174
Industry: type 4 167
Electricity 156
Hotel 134
Trade: type 6 122
Industry: type 5 97
Telecom 95
Emergency 91
Insurance 80
Industry: type 2 77
Industry: type 12 77
Realtor 72
Advertising 71
Trade: type 1 64
Culture 61
Legal Services 53
Mobile 45
Cleaning 43
Transport: type 1 35
Industry: type 6 27
Industry: type 10 24
Trade: type 4 14
Religion 12
Trade: type 5 9
Industry: type 13 6
Industry: type 8 3
EXT_SOURCE_1
Number empty: 20532
Percent empty: 42.12210733628754
count 28212.000000
mean 0.501180
std 0.205142
min 0.013458
25% 0.343695
50% 0.506771
75% 0.665956
max 0.939145
Name: EXT_SOURCE_1, dtype: float64
EXT_SOURCE_2
Number empty: 8
Percent empty: 0.016412276382734285
count 48736.000000
mean 0.518021
std 0.181278
min 0.000008
25% 0.408066
50% 0.558758
75% 0.658497
max 0.855000
Name: EXT_SOURCE_2, dtype: float64
EXT_SOURCE_3
Number empty: 8668
Percent empty: 17.782701460692596
count 40076.000000
mean 0.500106
std 0.189498
min 0.000527
25% 0.363945
50% 0.519097
75% 0.652897
max 0.882530
Name: EXT_SOURCE_3, dtype: float64
APARTMENTS_AVG
Number empty: 23887
Percent empty: 49.00500574429673
count 24857.000000
mean 0.122388
std 0.113112
min 0.000000
25% 0.061900
50% 0.092800
75% 0.148500
max 1.000000
Name: APARTMENTS_AVG, dtype: float64
BASEMENTAREA_AVG
Number empty: 27641
Percent empty: 56.7064664368948
count 21103.000000
mean 0.090065
std 0.081536
min 0.000000
25% 0.046700
50% 0.078100
75% 0.113400
max 1.000000
Name: BASEMENTAREA_AVG, dtype: float64
YEARS_BEGINEXPLUATATION_AVG
Number empty: 22856
Percent empty: 46.88987362547185
count 25888.000000
mean 0.978828
std 0.049318
min 0.000000
25% 0.976700
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_AVG, dtype: float64
YEARS_BUILD_AVG
Number empty: 31818
Percent empty: 65.27572624322994
count 16926.000000
mean 0.751137
std 0.113188
min 0.000000
25% 0.687200
50% 0.755200
75% 0.816400
max 1.000000
Name: YEARS_BUILD_AVG, dtype: float64
COMMONAREA_AVG
Number empty: 33495
Percent empty: 68.71614967996061
count 15249.000000
mean 0.047624
std 0.082868
min 0.000000
25% 0.008100
50% 0.022700
75% 0.053900
max 1.000000
Name: COMMONAREA_AVG, dtype: float64
ELEVATORS_AVG
Number empty: 25189
Percent empty: 51.67610372558674
count 23555.000000
mean 0.085168
std 0.139164
min 0.000000
25% 0.000000
50% 0.000000
75% 0.160000
max 1.000000
Name: ELEVATORS_AVG, dtype: float64
ENTRANCES_AVG
Number empty: 23579
Percent empty: 48.373133103561464
count 25165.000000
mean 0.151777
std 0.100669
min 0.000000
25% 0.074500
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_AVG, dtype: float64
FLOORSMAX_AVG
Number empty: 23321
Percent empty: 47.84383719021828
count 25423.000000
mean 0.233706
std 0.147361
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_AVG, dtype: float64
FLOORSMIN_AVG
Number empty: 32466
Percent empty: 66.60512063023141
count 16278.000000
mean 0.238423
std 0.164976
min 0.000000
25% 0.104200
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_AVG, dtype: float64
LANDAREA_AVG
Number empty: 28254
Percent empty: 57.96405711472181
count 20490.000000
mean 0.067192
std 0.081909
min 0.000000
25% 0.019000
50% 0.048300
75% 0.086800
max 1.000000
Name: LANDAREA_AVG, dtype: float64
LIVINGAPARTMENTS_AVG
Number empty: 32780
Percent empty: 67.24930247825374
count 15964.000000
mean 0.105885
std 0.098284
min 0.000000
25% 0.050400
50% 0.075600
75% 0.126900
max 1.000000
Name: LIVINGAPARTMENTS_AVG, dtype: float64
LIVINGAREA_AVG
Number empty: 23552
Percent empty: 48.317741670769735
count 25192.000000
mean 0.112286
std 0.114860
min 0.000000
25% 0.048575
50% 0.077000
75% 0.137600
max 1.000000
Name: LIVINGAREA_AVG, dtype: float64
NONLIVINGAPARTMENTS_AVG
Number empty: 33347
Percent empty: 68.41252256688003
count 15397.000000
mean 0.009231
std 0.048749
min 0.000000
25% 0.000000
50% 0.000000
75% 0.005100
max 1.000000
Name: NONLIVINGAPARTMENTS_AVG, dtype: float64
NONLIVINGAREA_AVG
Number empty: 26084
Percent empty: 53.512227145905136
count 22660.000000
mean 0.029387
std 0.072007
min 0.000000
25% 0.000000
50% 0.003800
75% 0.029000
max 1.000000
Name: NONLIVINGAREA_AVG, dtype: float64
APARTMENTS_MODE
Number empty: 23887
Percent empty: 49.00500574429673
count 24857.000000
mean 0.119078
std 0.113465
min 0.000000
25% 0.058800
50% 0.085100
75% 0.150200
max 1.000000
Name: APARTMENTS_MODE, dtype: float64
BASEMENTAREA_MODE
Number empty: 27641
Percent empty: 56.7064664368948
count 21103.000000
mean 0.088998
std 0.082655
min 0.000000
25% 0.042500
50% 0.077000
75% 0.113550
max 1.000000
Name: BASEMENTAREA_MODE, dtype: float64
YEARS_BEGINEXPLUATATION_MODE
Number empty: 22856
Percent empty: 46.88987362547185
count 25888.000000
mean 0.978292
std 0.053782
min 0.000000
25% 0.976200
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_MODE, dtype: float64
YEARS_BUILD_MODE
Number empty: 31818
Percent empty: 65.27572624322994
count 16926.000000
mean 0.758327
std 0.110117
min 0.000000
25% 0.692900
50% 0.758300
75% 0.823600
max 1.000000
Name: YEARS_BUILD_MODE, dtype: float64
COMMONAREA_MODE
Number empty: 33495
Percent empty: 68.71614967996061
count 15249.000000
mean 0.045223
std 0.081169
min 0.000000
25% 0.007600
50% 0.020300
75% 0.051700
max 1.000000
Name: COMMONAREA_MODE, dtype: float64
ELEVATORS_MODE
Number empty: 25189
Percent empty: 51.67610372558674
count 23555.000000
mean 0.080570
std 0.137509
min 0.000000
25% 0.000000
50% 0.000000
75% 0.120800
max 1.000000
Name: ELEVATORS_MODE, dtype: float64
ENTRANCES_MODE
Number empty: 23579
Percent empty: 48.373133103561464
count 25165.000000
mean 0.147161
std 0.101748
min 0.000000
25% 0.069000
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_MODE, dtype: float64
FLOORSMAX_MODE
Number empty: 23321
Percent empty: 47.84383719021828
count 25423.000000
mean 0.229390
std 0.146485
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_MODE, dtype: float64
FLOORSMIN_MODE
Number empty: 32466
Percent empty: 66.60512063023141
count 16278.000000
mean 0.233854
std 0.165034
min 0.000000
25% 0.083300
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_MODE, dtype: float64
LANDAREA_MODE
Number empty: 28254
Percent empty: 57.96405711472181
count 20490.000000
mean 0.065914
std 0.082880
min 0.000000
25% 0.016525
50% 0.046200
75% 0.085600
max 1.000000
Name: LANDAREA_MODE, dtype: float64
LIVINGAPARTMENTS_MODE
Number empty: 32780
Percent empty: 67.24930247825374
count 15964.000000
mean 0.110874
std 0.103980
min 0.000000
25% 0.055100
50% 0.081700
75% 0.132200
max 1.000000
Name: LIVINGAPARTMENTS_MODE, dtype: float64
LIVINGAREA_MODE
Number empty: 23552
Percent empty: 48.317741670769735
count 25192.000000
mean 0.110687
std 0.116699
min 0.000000
25% 0.045600
50% 0.075100
75% 0.130600
max 1.000000
Name: LIVINGAREA_MODE, dtype: float64
NONLIVINGAPARTMENTS_MODE
Number empty: 33347
Percent empty: 68.41252256688003
count 15397.000000
mean 0.008358
std 0.046657
min 0.000000
25% 0.000000
50% 0.000000
75% 0.003900
max 1.000000
Name: NONLIVINGAPARTMENTS_MODE, dtype: float64
NONLIVINGAREA_MODE
Number empty: 26084
Percent empty: 53.512227145905136
count 22660.000000
mean 0.028161
std 0.073504
min 0.000000
25% 0.000000
50% 0.001200
75% 0.024500
max 1.000000
Name: NONLIVINGAREA_MODE, dtype: float64
APARTMENTS_MEDI
Number empty: 23887
Percent empty: 49.00500574429673
count 24857.000000
mean 0.122809
std 0.114184
min 0.000000
25% 0.062500
50% 0.092600
75% 0.149900
max 1.000000
Name: APARTMENTS_MEDI, dtype: float64
BASEMENTAREA_MEDI
Number empty: 27641
Percent empty: 56.7064664368948
count 21103.000000
mean 0.089529
std 0.081022
min 0.000000
25% 0.046150
50% 0.077800
75% 0.113000
max 1.000000
Name: BASEMENTAREA_MEDI, dtype: float64
YEARS_BEGINEXPLUATATION_MEDI
Number empty: 22856
Percent empty: 46.88987362547185
count 25888.000000
mean 0.978822
std 0.049663
min 0.000000
25% 0.976700
50% 0.981600
75% 0.986600
max 1.000000
Name: YEARS_BEGINEXPLUATATION_MEDI, dtype: float64
YEARS_BUILD_MEDI
Number empty: 31818
Percent empty: 65.27572624322994
count 16926.000000
mean 0.754344
std 0.111998
min 0.000000
25% 0.691400
50% 0.758500
75% 0.818900
max 1.000000
Name: YEARS_BUILD_MEDI, dtype: float64
COMMONAREA_MEDI
Number empty: 33495
Percent empty: 68.71614967996061
count 15249.000000
mean 0.047420
std 0.082892
min 0.000000
25% 0.008000
50% 0.022300
75% 0.053800
max 1.000000
Name: COMMONAREA_MEDI, dtype: float64
ELEVATORS_MEDI
Number empty: 25189
Percent empty: 51.67610372558674
count 23555.000000
mean 0.084128
std 0.139014
min 0.000000
25% 0.000000
50% 0.000000
75% 0.160000
max 1.000000
Name: ELEVATORS_MEDI, dtype: float64
ENTRANCES_MEDI
Number empty: 23579
Percent empty: 48.373133103561464
count 25165.000000
mean 0.151200
std 0.100931
min 0.000000
25% 0.069000
50% 0.137900
75% 0.206900
max 1.000000
Name: ENTRANCES_MEDI, dtype: float64
FLOORSMAX_MEDI
Number empty: 23321
Percent empty: 47.84383719021828
count 25423.000000
mean 0.233154
std 0.147629
min 0.000000
25% 0.166700
50% 0.166700
75% 0.333300
max 1.000000
Name: FLOORSMAX_MEDI, dtype: float64
FLOORSMIN_MEDI
Number empty: 32466
Percent empty: 66.60512063023141
count 16278.000000
mean 0.237846
std 0.165241
min 0.000000
25% 0.083300
50% 0.208300
75% 0.375000
max 1.000000
Name: FLOORSMIN_MEDI, dtype: float64
LANDAREA_MEDI
Number empty: 28254
Percent empty: 57.96405711472181
count 20490.000000
mean 0.068069
std 0.082869
min 0.000000
25% 0.019000
50% 0.048800
75% 0.088000
max 1.000000
Name: LANDAREA_MEDI, dtype: float64
LIVINGAPARTMENTS_MEDI
Number empty: 32780
Percent empty: 67.24930247825374
count 15964.000000
mean 0.107063
std 0.099737
min 0.000000
25% 0.051300
50% 0.077000
75% 0.126600
max 1.000000
Name: LIVINGAPARTMENTS_MEDI, dtype: float64
LIVINGAREA_MEDI
Number empty: 23552
Percent empty: 48.317741670769735
count 25192.000000
mean 0.113368
std 0.116503
min 0.000000
25% 0.049000
50% 0.077600
75% 0.137425
max 1.000000
Name: LIVINGAREA_MEDI, dtype: float64
NONLIVINGAPARTMENTS_MEDI
Number empty: 33347
Percent empty: 68.41252256688003
count 15397.000000
mean 0.008979
std 0.048148
min 0.000000
25% 0.000000
50% 0.000000
75% 0.003900
max 1.000000
Name: NONLIVINGAPARTMENTS_MEDI, dtype: float64
NONLIVINGAREA_MEDI
Number empty: 26084
Percent empty: 53.512227145905136
count 22660.000000
mean 0.029296
std 0.072998
min 0.000000
25% 0.000000
50% 0.003100
75% 0.028025
max 1.000000
Name: NONLIVINGAREA_MEDI, dtype: float64
FONDKAPREMONT_MODE
Number empty: 32797
Percent empty: 67.28417856556705
count 15947
unique 4
top reg oper account
freq 12124
Name: FONDKAPREMONT_MODE, dtype: object
Categories and Count:
reg oper account 12124
reg oper spec account 1990
org spec account 920
not specified 913
HOUSETYPE_MODE
Number empty: 23619
Percent empty: 48.45519448547513
count 25125
unique 3
top block of flats
freq 24659
Name: HOUSETYPE_MODE, dtype: object
Categories and Count:
block of flats 24659
specific housing 262
terraced house 204
TOTALAREA_MODE
Number empty: 22624
Percent empty: 46.41391761037256
count 26120.000000
mean 0.107129
std 0.111420
min 0.000000
25% 0.043200
50% 0.070700
75% 0.135700
max 1.000000
Name: TOTALAREA_MODE, dtype: float64
WALLSMATERIAL_MODE
Number empty: 23893
Percent empty: 49.017314951583785
count 24851
unique 7
top Panel
freq 11269
Name: WALLSMATERIAL_MODE, dtype: object
Categories and Count:
Panel 11269
Stone, brick 10434
Block 1428
Wooden 794
Mixed 353
Monolithic 289
Others 284
EMERGENCYSTATE_MODE
Number empty: 22209
Percent empty: 45.56253077301822
count 26535
unique 2
top No
freq 26179
Name: EMERGENCYSTATE_MODE, dtype: object
Categories and Count:
No 26179
Yes 356
OBS_30_CNT_SOCIAL_CIRCLE
Number empty: 29
Percent empty: 0.05949450188741178
count 48715.000000
mean 1.447644
std 3.608053
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 354.000000
Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64
DEF_30_CNT_SOCIAL_CIRCLE
Number empty: 29
Percent empty: 0.05949450188741178
count 48715.000000
mean 0.143652
std 0.514413
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 34.000000
Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64
OBS_60_CNT_SOCIAL_CIRCLE
Number empty: 29
Percent empty: 0.05949450188741178
count 48715.000000
mean 1.435738
std 3.580125
min 0.000000
25% 0.000000
50% 0.000000
75% 2.000000
max 351.000000
Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64
DEF_60_CNT_SOCIAL_CIRCLE
Number empty: 29
Percent empty: 0.05949450188741178
count 48715.000000
mean 0.101139
std 0.403791
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 24.000000
Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64
DAYS_LAST_PHONE_CHANGE
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean -1077.766228
std 878.920740
min -4361.000000
25% -1766.250000
50% -863.000000
75% -363.000000
max 0.000000
Name: DAYS_LAST_PHONE_CHANGE, dtype: float64
FLAG_DOCUMENT_2
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_2, dtype: float64
FLAG_DOCUMENT_3
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.786620
std 0.409698
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: FLAG_DOCUMENT_3, dtype: float64
FLAG_DOCUMENT_4
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.000103
std 0.010128
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_4, dtype: float64
FLAG_DOCUMENT_5
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.014751
std 0.120554
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_5, dtype: float64
FLAG_DOCUMENT_6
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.087477
std 0.282536
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_6, dtype: float64
FLAG_DOCUMENT_7
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.000041
std 0.006405
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_7, dtype: float64
FLAG_DOCUMENT_8
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.088462
std 0.283969
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_8, dtype: float64
FLAG_DOCUMENT_9
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.004493
std 0.066879
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_9, dtype: float64
FLAG_DOCUMENT_10
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_10, dtype: float64
FLAG_DOCUMENT_11
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.001169
std 0.034176
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_11, dtype: float64
FLAG_DOCUMENT_12
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_12, dtype: float64
FLAG_DOCUMENT_13
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_13, dtype: float64
FLAG_DOCUMENT_14
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_14, dtype: float64
FLAG_DOCUMENT_15
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_15, dtype: float64
FLAG_DOCUMENT_16
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_16, dtype: float64
FLAG_DOCUMENT_17
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_17, dtype: float64
FLAG_DOCUMENT_18
Number empty: 0
Percent empty: 0.0
count 48744.000000
mean 0.001559
std 0.039456
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: FLAG_DOCUMENT_18, dtype: float64
FLAG_DOCUMENT_19
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_19, dtype: float64
FLAG_DOCUMENT_20
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_20, dtype: float64
FLAG_DOCUMENT_21
Number empty: 0
Percent empty: 0.0
count 48744.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: FLAG_DOCUMENT_21, dtype: float64
AMT_REQ_CREDIT_BUREAU_HOUR
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 0.002108
std 0.046373
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 2.000000
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
AMT_REQ_CREDIT_BUREAU_DAY
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 0.001803
std 0.046132
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 2.000000
Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
AMT_REQ_CREDIT_BUREAU_WEEK
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 0.002787
std 0.054037
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 2.000000
Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64
AMT_REQ_CREDIT_BUREAU_MON
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 0.009299
std 0.110924
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 6.000000
Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64
AMT_REQ_CREDIT_BUREAU_QRT
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 0.546902
std 0.693305
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 7.000000
Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
AMT_REQ_CREDIT_BUREAU_YEAR
Number empty: 6049
Percent empty: 12.409732479894961
count 42695.000000
mean 1.983769
std 1.838873
min 0.000000
25% 0.000000
50% 2.000000
75% 3.000000
max 17.000000
Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64
The column containing the values we are trying to predict, TARGET
, doesn’t contain any missing values. The value of TARGET
is 0 when the loan was repayed sucessfully, and 1 when there were problems repaying the loan. Many more loans were succesfully repayed than not, which means that the dataset is imbalanced in terms of our dependent variable, which is something we’ll have to watch out for when we build a predictive model later:
# Show target distribution
train['TARGET'].value_counts()
0 282686
1 24825
Name: TARGET, dtype: int64
There’s a lot of categorical columns - let’s check that, for each column, all the categories we see in the training set we also see in the test set, and vice-versa.
for col in test:
if test[col].dtype==object:
print(col)
print('Num Unique in Train:', train[col].nunique())
print('Num Unique in Test: ', test[col].nunique())
print('Unique in Train:', sorted([str(e) for e in train[col].unique().tolist()]))
print('Unique in Test: ', sorted([str(e) for e in test[col].unique().tolist()]))
print()
NAME_CONTRACT_TYPE
Num Unique in Train: 2
Num Unique in Test: 2
Unique in Train: ['Cash loans', 'Revolving loans']
Unique in Test: ['Cash loans', 'Revolving loans']
CODE_GENDER
Num Unique in Train: 3
Num Unique in Test: 2
Unique in Train: ['F', 'M', 'XNA']
Unique in Test: ['F', 'M']
FLAG_OWN_CAR
Num Unique in Train: 2
Num Unique in Test: 2
Unique in Train: ['N', 'Y']
Unique in Test: ['N', 'Y']
FLAG_OWN_REALTY
Num Unique in Train: 2
Num Unique in Test: 2
Unique in Train: ['N', 'Y']
Unique in Test: ['N', 'Y']
NAME_TYPE_SUITE
Num Unique in Train: 7
Num Unique in Test: 7
Unique in Train: ['Children', 'Family', 'Group of people', 'Other_A', 'Other_B', 'Spouse, partner', 'Unaccompanied', 'nan']
Unique in Test: ['Children', 'Family', 'Group of people', 'Other_A', 'Other_B', 'Spouse, partner', 'Unaccompanied', 'nan']
NAME_INCOME_TYPE
Num Unique in Train: 8
Num Unique in Test: 7
Unique in Train: ['Businessman', 'Commercial associate', 'Maternity leave', 'Pensioner', 'State servant', 'Student', 'Unemployed', 'Working']
Unique in Test: ['Businessman', 'Commercial associate', 'Pensioner', 'State servant', 'Student', 'Unemployed', 'Working']
NAME_EDUCATION_TYPE
Num Unique in Train: 5
Num Unique in Test: 5
Unique in Train: ['Academic degree', 'Higher education', 'Incomplete higher', 'Lower secondary', 'Secondary / secondary special']
Unique in Test: ['Academic degree', 'Higher education', 'Incomplete higher', 'Lower secondary', 'Secondary / secondary special']
NAME_FAMILY_STATUS
Num Unique in Train: 6
Num Unique in Test: 5
Unique in Train: ['Civil marriage', 'Married', 'Separated', 'Single / not married', 'Unknown', 'Widow']
Unique in Test: ['Civil marriage', 'Married', 'Separated', 'Single / not married', 'Widow']
NAME_HOUSING_TYPE
Num Unique in Train: 6
Num Unique in Test: 6
Unique in Train: ['Co-op apartment', 'House / apartment', 'Municipal apartment', 'Office apartment', 'Rented apartment', 'With parents']
Unique in Test: ['Co-op apartment', 'House / apartment', 'Municipal apartment', 'Office apartment', 'Rented apartment', 'With parents']
OCCUPATION_TYPE
Num Unique in Train: 18
Num Unique in Test: 18
Unique in Train: ['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff', 'Drivers', 'HR staff', 'High skill tech staff', 'IT staff', 'Laborers', 'Low-skill Laborers', 'Managers', 'Medicine staff', 'Private service staff', 'Realty agents', 'Sales staff', 'Secretaries', 'Security staff', 'Waiters/barmen staff', 'nan']
Unique in Test: ['Accountants', 'Cleaning staff', 'Cooking staff', 'Core staff', 'Drivers', 'HR staff', 'High skill tech staff', 'IT staff', 'Laborers', 'Low-skill Laborers', 'Managers', 'Medicine staff', 'Private service staff', 'Realty agents', 'Sales staff', 'Secretaries', 'Security staff', 'Waiters/barmen staff', 'nan']
WEEKDAY_APPR_PROCESS_START
Num Unique in Train: 7
Num Unique in Test: 7
Unique in Train: ['FRIDAY', 'MONDAY', 'SATURDAY', 'SUNDAY', 'THURSDAY', 'TUESDAY', 'WEDNESDAY']
Unique in Test: ['FRIDAY', 'MONDAY', 'SATURDAY', 'SUNDAY', 'THURSDAY', 'TUESDAY', 'WEDNESDAY']
ORGANIZATION_TYPE
Num Unique in Train: 58
Num Unique in Test: 58
Unique in Train: ['Advertising', 'Agriculture', 'Bank', 'Business Entity Type 1', 'Business Entity Type 2', 'Business Entity Type 3', 'Cleaning', 'Construction', 'Culture', 'Electricity', 'Emergency', 'Government', 'Hotel', 'Housing', 'Industry: type 1', 'Industry: type 10', 'Industry: type 11', 'Industry: type 12', 'Industry: type 13', 'Industry: type 2', 'Industry: type 3', 'Industry: type 4', 'Industry: type 5', 'Industry: type 6', 'Industry: type 7', 'Industry: type 8', 'Industry: type 9', 'Insurance', 'Kindergarten', 'Legal Services', 'Medicine', 'Military', 'Mobile', 'Other', 'Police', 'Postal', 'Realtor', 'Religion', 'Restaurant', 'School', 'Security', 'Security Ministries', 'Self-employed', 'Services', 'Telecom', 'Trade: type 1', 'Trade: type 2', 'Trade: type 3', 'Trade: type 4', 'Trade: type 5', 'Trade: type 6', 'Trade: type 7', 'Transport: type 1', 'Transport: type 2', 'Transport: type 3', 'Transport: type 4', 'University', 'XNA']
Unique in Test: ['Advertising', 'Agriculture', 'Bank', 'Business Entity Type 1', 'Business Entity Type 2', 'Business Entity Type 3', 'Cleaning', 'Construction', 'Culture', 'Electricity', 'Emergency', 'Government', 'Hotel', 'Housing', 'Industry: type 1', 'Industry: type 10', 'Industry: type 11', 'Industry: type 12', 'Industry: type 13', 'Industry: type 2', 'Industry: type 3', 'Industry: type 4', 'Industry: type 5', 'Industry: type 6', 'Industry: type 7', 'Industry: type 8', 'Industry: type 9', 'Insurance', 'Kindergarten', 'Legal Services', 'Medicine', 'Military', 'Mobile', 'Other', 'Police', 'Postal', 'Realtor', 'Religion', 'Restaurant', 'School', 'Security', 'Security Ministries', 'Self-employed', 'Services', 'Telecom', 'Trade: type 1', 'Trade: type 2', 'Trade: type 3', 'Trade: type 4', 'Trade: type 5', 'Trade: type 6', 'Trade: type 7', 'Transport: type 1', 'Transport: type 2', 'Transport: type 3', 'Transport: type 4', 'University', 'XNA']
FONDKAPREMONT_MODE
Num Unique in Train: 4
Num Unique in Test: 4
Unique in Train: ['nan', 'not specified', 'org spec account', 'reg oper account', 'reg oper spec account']
Unique in Test: ['nan', 'not specified', 'org spec account', 'reg oper account', 'reg oper spec account']
HOUSETYPE_MODE
Num Unique in Train: 3
Num Unique in Test: 3
Unique in Train: ['block of flats', 'nan', 'specific housing', 'terraced house']
Unique in Test: ['block of flats', 'nan', 'specific housing', 'terraced house']
WALLSMATERIAL_MODE
Num Unique in Train: 7
Num Unique in Test: 7
Unique in Train: ['Block', 'Mixed', 'Monolithic', 'Others', 'Panel', 'Stone, brick', 'Wooden', 'nan']
Unique in Test: ['Block', 'Mixed', 'Monolithic', 'Others', 'Panel', 'Stone, brick', 'Wooden', 'nan']
EMERGENCYSTATE_MODE
Num Unique in Train: 2
Num Unique in Test: 2
Unique in Train: ['No', 'Yes', 'nan']
Unique in Test: ['No', 'Yes', 'nan']
We’ll merge the test and training dataset, and create a column which indicates whether a sample is in the test or train dataset. That way, we can perform operations (label encoding, one-hot encoding, etc) to all the data together instead of doing it once to the training data and once to the test data.
# Merge test and train into all application data
train_o = train.copy()
train['Test'] = False
test['Test'] = True
test['TARGET'] = np.nan
app = train.append(test, ignore_index=True)
The gender column contains whether the loan applicant was male or female. The training datset contains 4 values which weren’t empty but were labelled XNA
. Normally we would want to create a new column to represent when the gender value is null. However, since the test dataset has only M
and F
entries, and because there are only 4 entries with a gender of XNA
in the training set, we’ll remove those entries from the training set.
# Remove entries with gender = XNA
app = app[app['CODE_GENDER'] != 'XNA']
The NAME_INCOME_TYPE
column also contained entries for applicants who were on Maternity leave, but no such applicants were in the test set. There were only 5 such applicants in the training set, so we’ll remove these from the training set.
# Remove entries with income type = maternity leave
app = app[app['NAME_INCOME_TYPE'] != 'Maternity leave']
Similarly, in the NAME_FAMILY_STATUS
column, there were 2 entries in the training set with values of Unknown
, and no entries with that value in the test set. So, we’ll remove those too.
# Remove entries with unknown family status
app = app[app['NAME_FAMILY_STATUS'] != 'Unknown']
There were some funky values in the DAYS_EMPLOYED
column:
app['DAYS_EMPLOYED'].hist()
plt.xlabel('DAYS_EMPLOYED')
plt.ylabel('Count')
plt.show()
350,000 days? That’s like 1,000 years! Looks like all the reasonable values represent the number of days between when the applicant was employed and the date of the loan application. The unreasonable values are all exactly 365,243, so we’ll set those to NaN
.
# Show distribution of reasonable values
app.loc[app['DAYS_EMPLOYED']<200000, 'DAYS_EMPLOYED'].hist()
plt.xlabel('DAYS_EMPLOYED (which are less than 200,000)')
plt.ylabel('Count')
plt.show()
# Show all unique outlier values
app.loc[app['DAYS_EMPLOYED']>200000, 'DAYS_EMPLOYED'].unique()
array([365243])
# Set unreasonable values to nan
app['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
Manual Feature Engineering
We’ll add some features which may be informative as to how likely an applicant is to repay their loan:
- The proportion of the applicant’s life they have been employed. If a 23-year-old has only been employed for 4 years, this is fine. If a 50-year-old has only ever been employed for 4 years, they may have trouble repaying their loan.
- The ratio of credit to income. More income than credit will likely help an applicant be able to repay their loan.
- The ratio of income to annuity.
- The ratio of income to annuity scaled by age.
- The ratio of credit to annuity. If an applicant has a high level of credit relative to their annuity, they may have trouble repaying their loan.
- The ratio of credit to annuity, scaled by age. If a young person doesn’t have much annuity this doesn’t really mean they’re less likely to repay their loan.
app['PROPORTION_LIFE_EMPLOYED'] = app['DAYS_EMPLOYED'] / app['DAYS_BIRTH']
app['INCOME_TO_CREDIT_RATIO'] = app['AMT_INCOME_TOTAL'] / app['AMT_CREDIT']
app['INCOME_TO_ANNUITY_RATIO'] = app['AMT_INCOME_TOTAL'] / app['AMT_ANNUITY']
app['INCOME_TO_ANNUITY_RATIO_BY_AGE'] = app['INCOME_TO_ANNUITY_RATIO'] * app['DAYS_BIRTH']
app['CREDIT_TO_ANNUITY_RATIO'] = app['AMT_CREDIT'] / app['AMT_ANNUITY']
app['CREDIT_TO_ANNUITY_RATIO_BY_AGE'] = app['CREDIT_TO_ANNUITY_RATIO'] * app['DAYS_BIRTH']
app['INCOME_TO_FAMILYSIZE_RATIO'] = app['AMT_INCOME_TOTAL'] / app['CNT_FAM_MEMBERS']
Feature Encoding
Some columns are non-numerical and will have to be encoded to numeric types so that our predictive algorithm can handle them. We’ll encode cyclical variables (like day of the week) into 2 dimensions, encode features with only two possible classes by assigning them 0 or 1, and one-hot encode categorical features with more than two classes.
The column WEEKDAY_APPR_PROCESS_START
contains categorical information corresponding to the day of the week. We could encode these categories as the values 1-7, but this would imply that Sunday and Monday are more similar than, say Tuesday and Sunday. We could also one-hot encode the column into 7 new columns, but that would create 7 additional dimensions. Seeing as the week is cyclical, we’ll encode this information into two dimensions by encoding them using polar coordinates. That is, we’ll represent the days of the week as a circle. That way, we can encode the days of the week independently, but only add two dimensions.
# Create map from categories to polar projection
DOW_map = {
'MONDAY': 0,
'TUESDAY': 1,
'WEDNESDAY': 2,
'THURSDAY': 3,
'FRIDAY': 4,
'SATURDAY': 5,
'SUNDAY': 6,
}
DOW_map1 = {k: np.cos(2*np.pi*v/7.0) for k, v in DOW_map.items()}
DOW_map2 = {k: np.sin(2*np.pi*v/7.0) for k, v in DOW_map.items()}
# Show encoding of days of week -> circle
days = ['MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY', 'SUNDAY']
tt = np.linspace(0, 2*np.pi, 200)
xx = np.cos(tt)
yy = np.sin(tt)
plt.plot(xx,yy)
plt.gca().axis('equal')
plt.xlabel('Encoded Dimension 1')
plt.ylabel('Encoded Dimension 2')
plt.title('2D Projection of days of the week')
for day in days:
plt.text(DOW_map1[day], DOW_map2[day], day, ha='center')
plt.show()
# WEEKDAY_APPR_PROCESS_START to polar coords
col = 'WEEKDAY_APPR_PROCESS_START'
app[col+'_1'] = app[col].map(DOW_map1)
app[col+'_2'] = app[col].map(DOW_map2)
app.drop(columns=col, inplace=True)
For the housing-related features (e.g. LIVINGAPARTMENTS_MODE
, BASEMENTAREA_AVG
, etc) there are combinations of some PREFIX (e.g. LIVINGAPARTMENTS
, BASEMENTAREA
, etc) and some POSTFIX (e.g. MODE
, MEDI
, AVG
, etc) into a variable PREFIX_POSTFIX
. However, if one value for a given PREFIX is empty, the other values for that PREFIX will also be empty.
For each column which has some empty values, we want to add an indicator column which is 1 if the value in the corresponding column is empty, and 0 otherwise. However, if we do this with the housing-related features, we’ll end up with a bunch of duplicate columns! This is because the same samples have null values across all the POSTFIX columns for a given PREFIX. The same problem crops up with the CREDIT_BUREAU-related features. To handle this problem, after creating the null indicator columns, we’ll check for duplicate columns and merge them.
So, first we’ll add columns to indicate where there are empty values in each other column.
# Add indicator columns for empty values
for col in app:
if col!='Test' and col!='TARGET':
app_null = app[col].isnull()
if app_null.sum()>0:
app[col+'_ISNULL'] = app_null
Then we can label encode categorical features with only 2 possible values (that is, turn the labels into either 0 or 1).
# Label encoder
le = LabelEncoder()
# Label encode binary fearures in training set
for col in app:
if col!='Test' and col!='TARGET' and app[col].dtype==object and app[col].nunique()==2:
if col+'_ISNULL' in app.columns: #missing values here?
app.loc[app[col+'_ISNULL'], col] = 'NaN'
app[col] = le.fit_transform(app[col])
if col+'_ISNULL' in app.columns: #re-remove missing vals
app.loc[app[col+'_ISNULL'], col] = np.nan
Then we’ll one-hot encode the categorical features which have more than 2 possible values.
# Get categorical features to encode
cat_features = []
for col in app:
if col!='Test' and col!='TARGET' and app[col].dtype==object and app[col].nunique()>2:
cat_features.append(col)
# One-hot encode categorical features in train set
app = pd.get_dummies(app, columns=cat_features)
And finally we’ll remove duplicate columns. We’ll hash the columns and check if the hashes match before checking if all the values actually match, because it’s a lot faster than comparing \( O(N^2) \) columns elementwise.
# Hash columns
hashes = dict()
for col in app:
hashes[col] = sha256(app[col].values).hexdigest()
# Get list of duplicate column lists
Ncol = app.shape[1] #number of columns
dup_list = []
dup_labels = -np.ones(Ncol)
for i1 in range(Ncol):
if dup_labels[i1]<0: #if not already merged,
col1 = app.columns[i1]
t_dup = [] #list of duplicates matching col1
for i2 in range(i1+1, Ncol):
col2 = app.columns[i2]
if ( dup_labels[i2]<0 #not already merged
and hashes[col1]==hashes[col2] #hashes match
and app[col1].equals(app[col2])): #cols are equal
#then this is actually a duplicate
t_dup.append(col2)
dup_labels[i2] = i1
if len(t_dup)>0: #duplicates of col1 were found!
t_dup.append(col1)
dup_list.append(t_dup)
# Merge duplicate columns
for iM in range(len(dup_list)):
new_name = 'Merged'+str(iM)
app[new_name] = app[dup_list[iM][0]].copy()
app.drop(columns=dup_list[iM], inplace=True)
print('Merged', dup_list[iM], 'into', new_name)
Merged ['INCOME_TO_ANNUITY_RATIO_ISNULL', 'INCOME_TO_ANNUITY_RATIO_BY_AGE_ISNULL', 'CREDIT_TO_ANNUITY_RATIO_ISNULL', 'CREDIT_TO_ANNUITY_RATIO_BY_AGE_ISNULL', 'AMT_ANNUITY_ISNULL'] into Merged0
Merged ['AMT_REQ_CREDIT_BUREAU_HOUR_ISNULL', 'AMT_REQ_CREDIT_BUREAU_MON_ISNULL', 'AMT_REQ_CREDIT_BUREAU_QRT_ISNULL', 'AMT_REQ_CREDIT_BUREAU_WEEK_ISNULL', 'AMT_REQ_CREDIT_BUREAU_YEAR_ISNULL', 'AMT_REQ_CREDIT_BUREAU_DAY_ISNULL'] into Merged1
Merged ['APARTMENTS_MEDI_ISNULL', 'APARTMENTS_MODE_ISNULL', 'APARTMENTS_AVG_ISNULL'] into Merged2
Merged ['BASEMENTAREA_MEDI_ISNULL', 'BASEMENTAREA_MODE_ISNULL', 'BASEMENTAREA_AVG_ISNULL'] into Merged3
Merged ['COMMONAREA_MEDI_ISNULL', 'COMMONAREA_MODE_ISNULL', 'COMMONAREA_AVG_ISNULL'] into Merged4
Merged ['PROPORTION_LIFE_EMPLOYED_ISNULL', 'DAYS_EMPLOYED_ISNULL'] into Merged5
Merged ['DEF_60_CNT_SOCIAL_CIRCLE_ISNULL', 'OBS_30_CNT_SOCIAL_CIRCLE_ISNULL', 'OBS_60_CNT_SOCIAL_CIRCLE_ISNULL', 'DEF_30_CNT_SOCIAL_CIRCLE_ISNULL'] into Merged6
Merged ['ELEVATORS_MEDI_ISNULL', 'ELEVATORS_MODE_ISNULL', 'ELEVATORS_AVG_ISNULL'] into Merged7
Merged ['ENTRANCES_MEDI_ISNULL', 'ENTRANCES_MODE_ISNULL', 'ENTRANCES_AVG_ISNULL'] into Merged8
Merged ['FLOORSMAX_MEDI_ISNULL', 'FLOORSMAX_MODE_ISNULL', 'FLOORSMAX_AVG_ISNULL'] into Merged9
Merged ['FLOORSMIN_MEDI_ISNULL', 'FLOORSMIN_MODE_ISNULL', 'FLOORSMIN_AVG_ISNULL'] into Merged10
Merged ['LANDAREA_MEDI_ISNULL', 'LANDAREA_MODE_ISNULL', 'LANDAREA_AVG_ISNULL'] into Merged11
Merged ['LIVINGAPARTMENTS_MEDI_ISNULL', 'LIVINGAPARTMENTS_MODE_ISNULL', 'LIVINGAPARTMENTS_AVG_ISNULL'] into Merged12
Merged ['LIVINGAREA_MEDI_ISNULL', 'LIVINGAREA_MODE_ISNULL', 'LIVINGAREA_AVG_ISNULL'] into Merged13
Merged ['NONLIVINGAPARTMENTS_MEDI_ISNULL', 'NONLIVINGAPARTMENTS_MODE_ISNULL', 'NONLIVINGAPARTMENTS_AVG_ISNULL'] into Merged14
Merged ['NONLIVINGAREA_MEDI_ISNULL', 'NONLIVINGAREA_MODE_ISNULL', 'NONLIVINGAREA_AVG_ISNULL'] into Merged15
Merged ['YEARS_BEGINEXPLUATATION_MEDI_ISNULL', 'YEARS_BEGINEXPLUATATION_MODE_ISNULL', 'YEARS_BEGINEXPLUATATION_AVG_ISNULL'] into Merged16
Merged ['YEARS_BUILD_MEDI_ISNULL', 'YEARS_BUILD_MODE_ISNULL', 'YEARS_BUILD_AVG_ISNULL'] into Merged17
Baseline Predictions
As a baseline, let’s use XGBoost with all the default parameters to predict the probabilities of applicants having trouble repaying their loans.
# Split data back into test + train
train = app.loc[~app['Test'], :]
test = app.loc[app['Test'], :]
# Make SK_ID_CURR the index
train.set_index('SK_ID_CURR', inplace=True)
test.set_index('SK_ID_CURR', inplace=True)
# Ensure all data is stored as floats
train = train.astype(np.float32)
test = test.astype(np.float32)
# Target labels
train_y = train['TARGET']
# Remove test/train indicator column and target column
train.drop(columns=['Test', 'TARGET'], inplace=True)
test.drop(columns=['Test', 'TARGET'], inplace=True)
# Classification pipeline
xgb_pipeline = Pipeline([
('scaler', RobustScaler()),
('imputer', SimpleImputer(strategy='median')),
('classifier', XGBClassifier())
])
# Cross-validated AUROC
auroc_scorer = make_scorer(roc_auc_score, needs_proba=True)
scores = cross_val_score(xgb_pipeline, train, train_y,
cv=3, scoring=auroc_scorer)
print('Mean AUROC:', scores.mean())
# Fit to training data
xgb_fit = xgb_pipeline.fit(train, train_y)
# Predict default probabilities of test data
test_pred = xgb_fit.predict_proba(test)
# Save predictions to file
df_out = pd.DataFrame()
df_out['SK_ID_CURR'] = test.index
df_out['TARGET'] = test_pred[:,1]
df_out.to_csv('xgboost_baseline.csv', index=False)
Mean AUROC: 0.7550236128842096
Calibration
One problem with the tree-based model is that the predicted probabilities tend to be overconfident. That is, when the actual probability of class=1 is closer to 0.5, the model predicts probabilities closer to 0 or 1 than 0.5. We can measure the extent of this overconfidence (or underconfidence) of our classifier by looking at its calibration curve. The calibration curve plots the probability predicted by our model against the actual probability of samples in that bin. A model which is perfectly calibrated should show a calibration curve which lies on the identity (y=x) line.
# Predict probabilities for the training data
train_pred = cross_val_predict(xgb_pipeline,
train,
y=train_y,
method='predict_proba')
train_pred = train_pred[:,1] #only want p(default)
# Show calibration curve
fraction_of_positives, mean_predicted_value = \
calibration_curve(train_y, train_pred, n_bins=10)
plt.figure()
plt.plot([0, 1], [0, 1], 'k:',
label='Perfectly Calibrated')
plt.plot(mean_predicted_value,
fraction_of_positives, 's-',
label='XGBoost Predictions')
plt.legend()
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration curve for baseline XGBoost model')
plt.show()
The model is pretty well calibrated as is, exept for at higher predicted probabilities. We can better calibrate our model by adjusting predicted probabilities to more accurately reflect the probability of loan default.
There are two commonly-used methods for model calibration:
- Sigmoid calibration (aka Platt’s scaling, which transforms the model’s predictions using a sigmoid so they more accurately reflect the actual probabilities)
- Isotonic calibration (which calibrates the model’s predictions using a method based on isotonic regression)
We’ll try both methods, and see if either betters the calibration of our model.
# Classification pipeline w/ isotonic calibration
calib_pipeline = Pipeline([
('scaler', RobustScaler()),
('imputer', SimpleImputer(strategy='median')),
('classifier', CalibratedClassifierCV(
base_estimator=XGBClassifier(),
method='isotonic'))
])
# Classification pipeline w/ sigmoid calibration
sig_pipeline = Pipeline([
('scaler', RobustScaler()),
('imputer', SimpleImputer(strategy='median')),
('classifier', CalibratedClassifierCV(
base_estimator=XGBClassifier(),
method='sigmoid'))
])
# Predict probabilities w/ isotonic calibration
calib_pred = cross_val_predict(calib_pipeline,
train,
y=train_y,
method='predict_proba')
calib_pred = calib_pred[:,1] #only want p(default)
# Predict probabilities w/ sigmoid calibration
sig_pred = cross_val_predict(sig_pipeline,
train,
y=train_y,
method='predict_proba')
sig_pred = sig_pred[:,1] #only want p(default)
# Show calibration curve
fop_calib, mpv_calib = \
calibration_curve(train_y, calib_pred, n_bins=10)
fop_sig, mpv_sig = \
calibration_curve(train_y, sig_pred, n_bins=10)
plt.figure()
plt.plot([0, 1], [0, 1], 'k:',
label='Perfectly Calibrated')
plt.plot(mean_predicted_value,
fraction_of_positives, 's-',
label='XGBoost Predictions')
plt.plot(mpv_calib, fop_calib, 's-',
label='Calibrated Predictions - isotonic')
plt.plot(mpv_sig, fop_sig, 's-',
label='Calibrated Predictions - sigmoid')
plt.legend()
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration curve for Calibrated XGBoost model')
plt.show()
# Cross-validated AUROC for isotonic
print('Mean AUROC with isotonic calibration:',
roc_auc_score(train_y, calib_pred))
# Cross-validated AUROC for sigmoid
print('Mean AUROC with sigmoid calibration:',
roc_auc_score(train_y, sig_pred))
Mean AUROC with isotonic calibration: 0.7557712988933571
Mean AUROC with sigmoid calibration: 0.755912952365782
Sigmoid calibration didn’t appear to work very well in this case… Isotonic calibration didn’t work perfectly either, however it did appear to improve the model’s discrimination a small bit (the model without calibration has slightly poorer discrimination in that it is more likely to predict probabilities which are close to 0.5). Isotonic calibration is usually only recommended if one has \( »1000 \) datapoints, which we do (the training set contains around 300,000 datapoins), so we’ll go ahead and use isotonic calibration. Now we can output our predictions after calibrating.
# Fit to the training data
calib_fit = calib_pipeline.fit(train, train_y)
# Predict default probabilities of the test data
test_pred = calib_fit.predict_proba(test)
# Save predictions to file
df_out = pd.DataFrame()
df_out['SK_ID_CURR'] = test.index
df_out['TARGET'] = test_pred[:,1]
df_out.to_csv('xgboost_calibrated.csv', index=False)
Resampling
The target class is very imbalanced: many more people successfully repaid their loans than had trouble repaying.
# Show distribution of target variable
sns.countplot(x='TARGET', data=app)
plt.title('Number of applicants who had trouble repaying')
plt.show()
We’ll use the imbalanced-learn package to re-sample our dataset such that the classes are balanced. There are several different common methods we could use for re-sampling:
- Random over-sampling (randomly repeat minority class examples in the training data)
- Random under-sampling (randomly drop majority class examples from the training data)
- Synthetic minority oversampling technique (SMOTE, generate additional synthetic training examples which are similar to the minority class)
We’ll try all three techniques, and see if any of the techniques give better predictive performance in terms of the AUROC.
# A sampler that doesn't re-sample!
class DummySampler(object):
def sample(self, X, y):
return X, y
def fit(self, X, y):
return self
def fit_sample(self, X, y):
return self.sample(X, y)
# List of samplers to test
samplers = [
['Oversampling', RandomOverSampler()],
['Undersampling', RandomUnderSampler()],
['SMOTE', SMOTE()],
['No resampling', DummySampler()]
]
# Preprocessing pipeline
pre_pipeline = Pipeline([
('scaler', RobustScaler()),
('imputer', SimpleImputer(strategy='median'))
])
# Classifier
classifier = CalibratedClassifierCV(
base_estimator=XGBClassifier(),
method='isotonic')
# Compute AUROC and plot ROC for each type of sampler
plt.figure()
auroc_scorer = make_scorer(roc_auc_score, needs_proba=True)
cv = StratifiedKFold(n_splits=3)
for name, sampler in samplers:
# Make the sampling and classification pipeline
pipeline = make_pipeline(sampler, calib_pipeline)
# Cross-validated predictions on training set
probas = np.zeros(train.shape[0]) # to store predicted probabilities
for tr, te in cv.split(train, train_y):
test_pre = pre_pipeline.fit_transform(train.iloc[te]) #preprocess test fold
train_pre = pre_pipeline.fit_transform(train.iloc[tr]) #preprocess training fold
train_s, train_y_s = sampler.fit_sample(train_pre, train_y.iloc[tr]) #resample train fold
probas_ = classifier.fit(train_s, train_y_s).predict_proba(test_pre) #predict test fold
probas[te] = probas_[:,1]
# Print AUROC value
print(name, 'AUROC:', roc_auc_score(train_y, probas))
# Plot ROC curve for this sampler
fpr, tpr, threshs = roc_curve(train_y, probas)
plt.plot(fpr, tpr, label=name)
plt.plot([0, 1], [0, 1], label='Chance')
plt.legend()
plt.show()
Oversampling AUROC: 0.7554566031903476
Undersampling AUROC: 0.754677367433835
SMOTE AUROC: 0.692312181508109
No resampling AUROC: 0.7544329991005927
Unfortunately it looks like none of the sampling techniques actually helped improve the AUROC score! The SMOTE resampling technique did even more poorly than simply under- or over-sampling. This is probably because SMOTE generates samples by interpolating between training samples in feature-space, but most of our features are binary. So, interpolation isn’t really adding diversity to the training data, it’s just adding noise and making it more difficult for our classification algorithm to decide where to put a threshold in that dimension.
Feature Importance
For our final predictions, we’ll use the isotonic calibrated model with no resampling (which we’ve already used to make predictions on the test data, back in the calibration section), since resampling didn’t appear to help increase the preformance of our model. We can view how important each feature was to the model’s predictions by using XGBoost’s plot_importance
function.
# Fit XGBoost model on the training data
train_pre = pre_pipeline.fit_transform(train) #preprocess training data
model = XGBClassifier()
model.fit(train, train_y)
# Show feature importances
plt.figure(figsize=(6, 15))
plot_importance(model, height=0.5, ax=plt.gca())
plt.show()
The three most important factors by far were the three “external sources.” Presumably these were credit scores or some other similar reliability measure from sources outside Home Credit. The credit-to-annuity ration was also very important, and other factors such as employment length, age, gender - gender?
# Show default probability by gender
plt.figure()
sns.barplot(x='CODE_GENDER', y="TARGET", data=train_o)
plt.show()
Indeed female applicants only default on their loans around 7% of the time, while male applicants default around 10% of the time.
Conclusion
Now that we’ve built a working predictive model, it would normally be time to put it into production. However, there are a few things we should consider before doing so. Firstly, because this model could have a direct effect on large number of individuals’ financial lives, we need to ensure our model is being equitable, and isn’t discriminating by proxy against certain groups based on race, gender, ethnicity, etc. Also, because we used applicants’ personal information to train the model, we should ensure that those applicants have been informed their information would be used for such a purpose, that they have given consent, and that we have minimized the personally identifiable information present in the dataset. Still other ethical issues exist which we would want to address before putting our model into production. There are tools to help us ensure our model and data practices more ethical, such as checklists like Deon and toolkits like the Ethics and Algorithms Toolkit.
Another thing to prepare for when considering putting a predictive model into deployment is the possibility of covariate shift or concept drift. We would want to have a monitoring system in place for a deployed model which could alert us when our data inputs appear to be changing over time, or when our model is no longer fitting the data as well as it used to (or, generally, when things are changing unexpectedly).
Finally, remember that the point of building a predictive model to estimate how likely applicants are to pay back their loans is not just for Home Credit Group to use that information to accept or reject applicants. Rather, they want to be able to predict which principal and payment plan would be the best option for each applicant. An even more useful model would be one which predicted loan repayment probability given not only the applicant information, but also information about the proposed principal and payment schedule. This way, Home Credit Group could use the model as a tool to decide not only whether to accept or reject applicants, but to determine the specifics of a loan which would be best for each of their applicants.