Elo is a Brazillian debit and credit card brand. They offer credit and prepaid transactions, and have paired up with merchants in order offer promotions to cardholders. In order to offer more relevant and personalized promotions, in a recent Kaggle competition, Elo challenged Kagglers to predict customer loyalty based on transaction history. Presumably they plan to use a loyalty-predicting model in order to determine what promotions to offer to customers based on how certain offers are predicted to affect card owners’ card loyalty.
In this post, we’ll load, clean, and explore the raw data from the Elo Merchant Category Recommendation challenge. The name of the challenge is a bit misleading, seeing as the immediate goal is to predict customer loyalty - not to recommend merchants to card owners. In future posts, we’ll work on feature engineering, feature selection, modeling, and understanding our model’s predictions.
Outline
Let’s first load the packages we’ll use:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns
# Plot settings
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
sns.set()
# Tools to show 2d histogram and summarize a dataframe
!pip install git+http://github.com/brendanhasz/dsutils.git
from dsutils.printing import print_table, describe_df
from dsutils.eda import countplot2d
Data Loading and Cleaning
There are 5 different data tables in the Elo dataset:
train.csv
test.csv
merchants.csv
historical_transactions.csv
new_merchant_transactions.csv
The first file (train.csv
) contains information about the credit card accounts and the loyalty score for each. test.csv
contains the same information, but about different accounts, and it does not contain loyalty scores - because those are the accounts for which we are trying to predict the loyalty! merchants.csv
has information about the merchants where the credit cards are being used to make transactions. Finally, the historical_transactions.csv
and new_merchant_transactions.csv
files contain lists of all the individual transactions which were made between the credit cards and the merchants, and some features of those transactions (time and date, purchase amount, etc).
Eventually, we’ll want to build a predictive model, train it on the cards data in train.csv
and the information in the merchants and transactions datasets associated with those cards, and then predict the loyalty scores for the cards in test.csv
(using the data in the merchants and transactions datasets associated with cards in test.csv
). But first, we need to load and clean the data!
Cards Data
Let’s start by taking a look at the cards dataset, which contains information about the credit card accounts for which we’ll be predicting the loyalty score. There are two files (test.csv
and train.csv
) which contain information about the cards - the only difference is that train.csv
has the loyalty score for each card, while test.csv
doesn’t (because we are going to try to predict the loyalty score for those accounts). Let’s load both the files:
# Load card data
dtypes = {
'card_id': 'str',
'target': 'float32',
'first_active_month': 'str',
'feature_1': 'uint8',
'feature_2': 'uint8',
'feature_3': 'uint8',
}
train = pd.read_csv('../input/train.csv',
usecols=dtypes.keys(),
dtype=dtypes)
del dtypes['target']
test = pd.read_csv('../input/test.csv',
usecols=dtypes.keys(),
dtype=dtypes)
There are no nulls anywhere in train.csv
:
describe_df(train)
Rows: 201917
Columns: 6
Memory usage: 4644171 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
first_active_month object 0 2011-11 2018-02 2017-09 75 unique values
card_id object 0 C_ID_00007093c1 C_ID_fffffd5772 C_ID_00007093c1 201917 unique values
feature_1 uint8 0 1 3.1053 5 3 [5 4 2 1 3]
feature_2 uint8 0 1 1.7454 3 1 [2 1 3]
feature_3 uint8 0 0 0.56557 1 1 [1 0]
target float32 0 -33.21928 -0.39363 17.965069 196910 unique values
But, there is a single row in test.csv
which has a null value in the first_active_month
column:
describe_df(test)
Rows: 123623
Columns: 5
Memory usage: 2348917 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
first_active_month object 1 2011-11 2018-01 2017-09 75 unique values
card_id object 0 C_ID_0001238066 C_ID_ffff756266 C_ID_0001238066 123623 unique values
feature_1 uint8 0 1 3.1093 5 3 [3 2 5 1 4]
feature_2 uint8 0 1 1.7418 3 1 [3 1 2]
feature_3 uint8 0 0 0.56438 1 1 [1 0]
To ensure we process the test and training data identically, we’ll merge them into a single dataframe called cards
. Since the test dataset has no target
values (the loyaly score we are trying to predict), we’ll set the target to NaN for the test data.
# Add target col to test
test['target'] = np.nan
# Merge test and train
cards = pd.concat([train, test])
As a quick sanity check, let’s make sure there aren’t any card_id
s which are in both test and train:
print('Num unique in train: ', test['card_id'].nunique())
print('Num unique in test: ', train['card_id'].nunique())
print('The sum: ', test['card_id'].nunique()+train['card_id'].nunique())
print('Num unique in merged: ', cards['card_id'].nunique())
Num unique in train: 123623
Num unique in test: 201917
The sum: 325540
Num unique in merged: 325540
OK good, there aren’t. Now we can delete the original dataframes.
del train, test
Let’s take a look at the cards data.
cards.sample(10)
first_active_month | card_id | feature_1 | feature_2 | feature_3 | target | |
---|---|---|---|---|---|---|
169379 | 2017-10 | C_ID_567d82ef36 | 2 | 1 | 0 | 0.615178 |
108554 | 2016-10 | C_ID_227cd7034e | 3 | 1 | 1 | -2.484916 |
27697 | 2017-09 | C_ID_f373d896b8 | 1 | 1 | 0 | 3.046826 |
96331 | 2016-08 | C_ID_dfa0add68c | 3 | 2 | 1 | -2.363163 |
127928 | 2016-12 | C_ID_e9f7b3f773 | 3 | 1 | 1 | -4.481597 |
73586 | 2017-01 | C_ID_52a6c10ab6 | 5 | 1 | 1 | NaN |
186489 | 2017-10 | C_ID_aec2b848c3 | 2 | 3 | 0 | -0.022860 |
124401 | 2017-09 | C_ID_4e98b574cd | 4 | 1 | 0 | -1.480891 |
171637 | 2017-11 | C_ID_63623956d8 | 3 | 2 | 1 | 0.748818 |
53716 | 2017-11 | C_ID_f0bd39f9f1 | 5 | 1 | 1 | 0.708641 |
The card_id
s always start with C_ID_
. That’s kind of a waste of space… All the card_id
s are the same length:
cards['card_id'].apply(len).unique()
array([15])
If we cut off the prefix, the remaining strings appear to be all hexidecimal (represented by values 0-9 and a-f):
cards['card_id'].str.slice(5, 15).sample(10)
62166 fe65673a57
39739 d9ce8dcb66
82124 54ef8f34fc
20822 27a11a90b0
198797 d57fb31b71
116945 52080b2444
162159 add2b4ec65
7178 94c4d274b7
564 ab4fa4d1a4
12111 aff4236261
Name: card_id, dtype: object
But let’s check that this is the case for every entry:
(cards['card_id']
.str.slice(5, 15)
.apply(lambda x: all(e in '0123456789abcdef' for e in x))
.all())
True
Indeed all the card_id
s are hexidecimal. To save space, we could convert the card_id
s to integers like this:
#cards['card_id'] = cards['card_id'].apply(lambda x: int(x, 16))
There are 5 bytes worth of hex info in the card ids (each hexideximal digit represents 4 bits, so the 10 hexideximal digits in the card ids represent 40 bits = 5 bytes of information). Inconveniently, the values span the full 5-byte range - so, we’d have to use a 64-bit integer to represent them (which is 8 bytes, instead of a 32-bit int, which is only 4 bytes). However, that would be a waste of space, seeing as there are only 325,540 unique card_ids, and could easily be represented in 4 bytes by a uint32 (which stores values up to ~4 billion). The inneficiency of using a 64-bit representation for something where a 32-bit reprentation would do… Bothers me. eye twitches
To use a 32-bit integer, we’ll create a map between the card_id and a unique integer which identifies it, and then map the string values to integer values.
# Create a map from card_id to unique int
card_id_map = dict(zip(
cards['card_id'].values,
cards['card_id'].astype('category').cat.codes.values
))
# Map the values
cards['card_id'] = cards['card_id'].map(card_id_map).astype('uint32')
Now our card_id
s are 32-bit integers:
cards.sample(10)
first_active_month | card_id | feature_1 | feature_2 | feature_3 | target | |
---|---|---|---|---|---|---|
22073 | 2016-04 | 147496 | 4 | 2 | 0 | NaN |
47869 | 2017-09 | 324058 | 3 | 1 | 1 | NaN |
36251 | 2016-07 | 240816 | 3 | 2 | 1 | 1.346112 |
123084 | 2017-11 | 122847 | 5 | 1 | 1 | NaN |
164459 | 2017-04 | 140541 | 2 | 1 | 0 | -1.534803 |
10001 | 2016-07 | 123276 | 2 | 2 | 0 | NaN |
54630 | 2017-09 | 152158 | 4 | 3 | 0 | NaN |
91594 | 2016-11 | 86657 | 5 | 2 | 1 | NaN |
697 | 2017-04 | 160097 | 2 | 1 | 0 | NaN |
175982 | 2017-03 | 15589 | 2 | 1 | 0 | 0.198213 |
Next, we’ll convert the first_active_month
from a string to a datetime.
# Convert first_active_month to datetime
cards['first_active_month'] = pd.to_datetime(cards['first_active_month'],
format='%Y-%m')
Finally, we’ll set the index to be the card_id
.
# Make card_id the index
cards.set_index('card_id', inplace=True)
cards.sample(10)
first_active_month | feature_1 | feature_2 | feature_3 | target | |
---|---|---|---|---|---|
card_id | |||||
321941 | 2017-09-01 | 1 | 3 | 0 | NaN |
224276 | 2017-04-01 | 3 | 2 | 1 | 1.316305 |
23765 | 2016-12-01 | 3 | 2 | 1 | NaN |
299417 | 2017-03-01 | 2 | 2 | 0 | 0.164603 |
180218 | 2017-06-01 | 3 | 2 | 1 | -1.829190 |
167345 | 2017-07-01 | 4 | 3 | 0 | 0.469027 |
24617 | 2017-03-01 | 4 | 2 | 0 | 0.257355 |
96063 | 2017-05-01 | 5 | 2 | 1 | -0.527315 |
159802 | 2017-10-01 | 2 | 3 | 0 | NaN |
314344 | 2017-06-01 | 2 | 1 | 0 | NaN |
Mismatch between Merchants and Transactions Data
Both the transactions data (historical_transactions.csv
and new_merchant_transactions.csv
) and the merchants data (merchants.csv
) have the following columns:
city_id
state_id
category_1
category_2
subsector_id
merchant_category_id
, andmerchant_id
Obviously we want them both to have merchant_id
so that we can connect the tables, but it’s unclear whether the other columns in the transactions table represent information about the merchant or the transaction. It seems that merchant_category_id
and subsector_id
at least should definitely represent information about the merchant, and therefore should be the same between transactions involving the same merchant_id
, and should also match the information in merchants.csv
. If city_id
and state_id
represent, say, the corporate mailing address of the merchant, then those should also remain constant. On the other hand, if they represent the city and state where the transaction occurred, there may be many different values for the same merchant_id
. Since category_1
and category_2
are in both datasets, it seems likely that they also represent information about the merchant.
However, let’s verify whether that’s true. First let’s load only the columns of interest:
# Datatypes of each column
dtypes = {
'city_id': 'int16',
'category_1': 'str',
'merchant_category_id': 'int16',
'merchant_id': 'str',
'category_2': 'float16',
'state_id': 'int8',
'subsector_id': 'int8'
}
# Load the data
hist_trans = pd.read_csv('../input/historical_transactions.csv',
usecols=dtypes.keys(),
dtype=dtypes)
new_trans = pd.read_csv('../input/new_merchant_transactions.csv',
usecols=dtypes.keys(),
dtype=dtypes)
merchants = pd.read_csv('../input/merchants.csv',
usecols=dtypes.keys(),
dtype=dtypes)
# Merge new_merchant and historical transactions
trans = pd.concat([hist_trans, new_trans])
del hist_trans
del new_trans
Then we can check that the values of city_id
, state_id
, category_1
, category_2
, subsector_id
, and merchant_category_id
match between transactions which have the same merchant_id
.
# For each column, count merchant_ids w/ >1 unique vals
gbo = trans.groupby('merchant_id')
nuniques = []
cols = []
for col in trans:
if col == 'merchant_id': continue
nuniques.append((gbo[col].nunique() > 1).sum())
cols.append(col)
print_table(['Column', 'Number unique'],
[cols, nuniques])
Column Number unique
city_id 25207
category_1 3494
merchant_category_id 82636
category_2 7844
state_id 17435
subsector_id 44373
Hmm, there’s a bunch of mismatches. A sizeable chunk of the merchants (there are only ~335,000) have features which change in the transactions data. How many of the transactions have merchant-specific values which don’t match the data in the merchants table?
# Join trans w/ merchants on merchant_id
# Check that all feature_transactions==feature_merchants
df = trans.merge(merchants, how='outer', on='merchant_id',
suffixes=('', '_merchants'))
cols = []
mismatches = []
for col in trans:
if 'merchant_id' in col: continue
sames = ((df[col] == df[col+'_merchants']) |
(df[col].isnull() & df[col+'_merchants'].isnull())).sum()
cols.append(col)
mismatches.append(df.shape[0]-sames)
# Print the number of mismatches
print_table(['Column', 'Num mismatches'],
[cols, mismatches])
print('Total number of transactions: ', df.shape[0])
# Clean up
del trans, merchants, df, sames
Column Num mismatches
city_id 9556796
category_1 3320439
merchant_category_id 5312999
category_2 4428102
state_id 4538880
subsector_id 4084017
Total number of transactions: 32328675
So, it appears that there is a pretty respectable mismatch between the merchant-specific values in the transactions data and in the merchants data. This is probably because the merchant’s properties may have changed in the time between the transaction and the time the merchants dataset was compiled. Therefore, we’ll use the values in the transactions table (and not the values in the merchants table) when creating features for our predictive model, because the values in the transactions table are more likely to reflect the merchant at the time when it was important - when the transaction occurred.
Merchants Data
Again, the merchants dataset (in merchants.csv
) contains information about the merchants which card owners are making transactions with. Let’s load the data, but because of the overlap with the transactions data just discussed, we won’t load the columns which are in the transactions datasets.
# Datatypes of each column
# (don't load cols which are in transactions data)
dtypes = {
'merchant_id': 'str',
'merchant_group_id': 'uint32',
'numerical_1': 'float32',
'numerical_2': 'float32',
'most_recent_sales_range': 'str',
'most_recent_purchases_range': 'str',
'avg_sales_lag3': 'float32',
'avg_purchases_lag3': 'float32',
'active_months_lag3': 'uint8',
'avg_sales_lag6': 'float32',
'avg_purchases_lag6': 'float32',
'active_months_lag6': 'uint8',
'avg_sales_lag12': 'float32',
'avg_purchases_lag12': 'float32',
'active_months_lag12': 'uint8',
'category_4': 'str',
}
# Load the data
merchants = pd.read_csv('../input/merchants.csv',
usecols=dtypes.keys(),
dtype=dtypes)
Let’s take a look at the merchants data.
merchants.sample(10)
merchant_id | merchant_group_id | numerical_1 | numerical_2 | most_recent_sales_range | most_recent_purchases_range | avg_sales_lag3 | avg_purchases_lag3 | active_months_lag3 | avg_sales_lag6 | avg_purchases_lag6 | active_months_lag6 | avg_sales_lag12 | avg_purchases_lag12 | active_months_lag12 | category_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105248 | M_ID_e879807296 | 2079 | -0.047556 | -0.047556 | E | D | 1.24 | 1.124378 | 3 | 1.24 | 1.152985 | 6 | 1.32 | 1.094075 | 12 | N |
275294 | M_ID_e2356c182b | 50115 | -0.057471 | -0.057471 | D | D | 0.79 | 0.834593 | 3 | 0.60 | 0.701092 | 6 | 0.75 | 0.836692 | 12 | Y |
198484 | M_ID_a21d4ed69a | 3648 | -0.037641 | -0.037641 | D | D | 1.05 | 1.094181 | 3 | 1.00 | 1.009369 | 6 | 0.99 | 0.992621 | 12 | N |
5902 | M_ID_8bfcf03fae | 18245 | -0.057471 | -0.057471 | E | E | 1.53 | 3.666667 | 3 | 8.49 | 18.250000 | 6 | 11.89 | 28.625000 | 12 | N |
119340 | M_ID_7e4b42b796 | 35 | 0.101168 | 0.101168 | E | E | 1.37 | 1.430769 | 3 | 1.48 | 1.592308 | 6 | 1.51 | 1.595910 | 12 | Y |
183878 | M_ID_2b0b769211 | 60289 | -0.057471 | -0.057471 | D | D | 1.05 | 1.028902 | 3 | 1.12 | 1.034682 | 6 | 1.15 | 1.071291 | 12 | Y |
207545 | M_ID_c88793ae2c | 58298 | -0.057471 | -0.057471 | D | D | 0.92 | 0.836364 | 3 | 0.64 | 0.687879 | 6 | 0.53 | 0.633333 | 12 | N |
42534 | M_ID_5194dda334 | 22398 | -0.057471 | -0.057471 | E | E | 1.41 | 1.444444 | 3 | 1.75 | 1.777778 | 6 | 1.65 | 1.708333 | 12 | N |
325227 | M_ID_7a4ab98f25 | 5441 | 0.002019 | 0.002019 | C | C | 0.88 | 0.905232 | 3 | 0.79 | 0.903220 | 6 | 0.87 | 1.015002 | 12 | N |
244815 | M_ID_9684efbe70 | 79120 | -0.047556 | -0.047556 | D | D | 0.98 | 0.990826 | 3 | 0.99 | 1.029817 | 6 | 0.97 | 1.066896 | 12 | Y |
We’ll map the merchant_id
s to integers like we did for the card_id
s. We’ll also transform some columns to be more manageable, and encode categorical columns to integers.
# Map merchant_id to integer
merch_id_map = dict(zip(
merchants['merchant_id'].values,
merchants['merchant_id'].astype('category').cat.codes.values
))
def preprocess_merch_data(df):
# Convert merchant ID to numbers
df['merchant_id'] = df['merchant_id'].map(merch_id_map).astype('float32')
# Inverse transforms
inversions = [
'avg_sales_lag3',
'avg_sales_lag6',
'avg_sales_lag12',
'avg_purchases_lag3',
'avg_purchases_lag6',
'avg_purchases_lag12',
]
for col in inversions:
df[col] = 1.0/df[col]
# Encode categorical columns
bool_map = {'Y': 1, 'N': 0}
five_map = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
conversions = [
('category_4', bool_map, 'uint8'),
('most_recent_sales_range', five_map, 'uint8'),
('most_recent_purchases_range', five_map, 'uint8')
]
for col, mapper, new_type in conversions:
df[col] = df[col].map(mapper).astype(new_type)
# Preprocess the merchants data
preprocess_merch_data(merchants)
There are no duplicate rows in the merchants dataset:
print("Number of duplicate rows in merchants.csv: %d"
% merchants.duplicated().sum())
Number of duplicate rows in merchants.csv: 0
But, there are duplicate merchant_id
s:
print("Number of duplicate merchant_ids: %d" %
merchants['merchant_id'].duplicated().sum())
Number of duplicate merchant_ids: 63
Let’s take a look at some of the duplicate entries to see which ones look like they should be removed.
# Show some of the duplicates
duplicates = merchants['merchant_id'].duplicated(keep=False)
merchants[duplicates].sort_values('merchant_id').head(6)
merchant_id | merchant_group_id | numerical_1 | numerical_2 | most_recent_sales_range | most_recent_purchases_range | avg_sales_lag3 | avg_purchases_lag3 | active_months_lag3 | avg_sales_lag6 | avg_purchases_lag6 | active_months_lag6 | avg_sales_lag12 | avg_purchases_lag12 | active_months_lag12 | category_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
111 | 307.0 | 35 | -0.057471 | -0.057471 | 4 | 4 | 1.515151 | 1.500000 | 3 | 1.515151 | 1.500000 | 3 | 1.515151 | 1.500000 | 3 | 1 |
112 | 307.0 | 35 | 0.408530 | 0.408530 | 2 | 3 | 0.854701 | 0.901186 | 3 | 0.854701 | 0.919170 | 6 | 0.840336 | 1.002988 | 12 | 1 |
110970 | 854.0 | 35 | -0.057471 | -0.057471 | 0 | 0 | 1.075269 | 1.053421 | 3 | 1.162791 | 1.147823 | 6 | 1.315789 | 1.328655 | 12 | 1 |
110969 | 854.0 | 35 | -0.047556 | -0.057471 | 4 | 4 | 0.144300 | 0.078708 | 3 | 0.118765 | 0.063069 | 6 | 0.116686 | 0.061653 | 7 | 1 |
70494 | 10108.0 | 35 | 0.031763 | 0.031763 | 4 | 4 | 0.793651 | 0.875486 | 3 | 0.833333 | 0.797872 | 6 | 0.869565 | 0.791872 | 12 | 0 |
70493 | 10108.0 | 35 | -0.057471 | -0.057471 | 4 | 4 | 1.000000 | 1.000000 | 1 | 1.000000 | 1.000000 | 1 | 1.000000 | 1.000000 | 1 | 0 |
Unfortunately it’s not super clear which of the entries is invalid, outdated, or incomplete. If one entry had empty values and another didn’t then we could just remove the entry with empty values. As it is, we’ll just remove all duplicate entries but the first.
# Drop duplicate entries
merchants.drop_duplicates(subset='merchant_id',
keep='first', inplace=True)
Transactions Data
The transactions datasets contain information About transactions which card owners have made with the merchants in the merchants dataset. historical_transactions.csv
contains historical transactions (transactions since the card account was opened), while new_merchant_transactions.csv
contains information about transactions during the period before the loyalty score was calculated.
In order to make predictions about the loyalty of card owners, we’ll have to use their transaction histories to come up with informative features about each customer. But, before we can do that, we have to load and clean the data!
Since there are many transactions per customer (or per merchant), the transactions datasets are much larger than the cards and merchants datasets. We’ve been doing this the whole time, but this is really where it helps a lot - specifying the datatypes to use for each column when calling panda’s read_csv
really speeds up the loading of the data. Let’s load the historical and new transactions using this method.
# Datatypes of each column
dtypes = {
'authorized_flag': 'str',
'card_id': 'str',
'city_id': 'int16',
'category_1': 'str',
'installments': 'int8',
'category_3': 'str',
'merchant_category_id': 'int16',
'merchant_id': 'str',
'month_lag': 'int8',
'purchase_amount': 'float32',
'purchase_date': 'str',
'category_2': 'float32',
'state_id': 'int8',
'subsector_id': 'int8',
}
# Load the data
hist_trans = pd.read_csv('../input/historical_transactions.csv',
usecols=dtypes.keys(),
dtype=dtypes)
new_trans = pd.read_csv('../input/new_merchant_transactions.csv',
usecols=dtypes.keys(),
dtype=dtypes)
We also need to convert the card_id
to an integer as before, merchant_id
to an integer in the same way, convert the purchase_date
column (which has been loaded as a string) to datetime format, and encode the categorical columns.
def preprocess_trans_data(df):
# Convert card_id and merchant_id to numbers
df['card_id'] = df['card_id'].map(card_id_map).astype('uint32')
df['merchant_id'] = df['merchant_id'].map(merch_id_map).astype('float32')
# Convert purchase_date to datetime
df['purchase_date'] = df['purchase_date'].str.slice(0, 19)
df['purchase_date'] = pd.to_datetime(df['purchase_date'],
format='%Y-%m-%d %H:%M:%S')
# Encode categorical columns
bool_map = {'Y': 1, 'N': 0}
three_map = {'A': 0, 'B': 1, 'C': 2}
conversions = [
('authorized_flag', bool_map, 'uint8'),
('category_1', bool_map, 'uint8'),
('category_3', three_map, 'float32'), #has NaNs so have to use float
]
for col, mapper, new_type in conversions:
df[col] = df[col].map(mapper).astype(new_type)
# Preprocess the transactions data
preprocess_trans_data(hist_trans)
preprocess_trans_data(new_trans)
However, there are a few funky (yet valid) values in the dataset. Taking a quick view at the dataframe, there are a few value which are odd:
describe_df(hist_trans)
Rows: 29112361
Columns: 14
Memory usage: 1106269798 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
authorized_flag uint8 0 0 0.91355 1 1 [1 0]
card_id uint32 0 0 1.6251e+05 325539 78027 325540 unique values
city_id int16 0 -1 129.33 347 69 308 unique values
category_1 uint8 0 0 0.071586 1 0 [0 1]
installments int8 0 -25 0.64188 12 0 15 unique values
category_3 float32 178159 0.0 0.53111 2.0 [0. 1. 2.]
merchant_category_id int16 0 -1 481.01 891 705 327 unique values
merchant_id float32 138481 0.0 1.5751e+05 334632.0 326311 unique values
month_lag int8 0 -13 -4.4873 0 -2 14 unique values
purchase_amount float32 0 -0.7469078 0.061346 6010604.0 215014 unique values
purchase_date datetime64[ns] 0 2017-01-01 00:00:08 2018-02-28 23:59:51 16395300 unique values
category_2 float32 2652864 1.0 1.777 5.0 [1. 3. 5. 2. 4.]
state_id int8 0 -1 10.567 24 9 25 unique values
subsector_id int8 0 -1 26.848 41 33 41 unique values
There are values of -1 in the city_id
, installments
, and state_id
columns, while the vast majority of the features’ ranges are non-negative. Especially for installments
: it makes no sense to have a negative number of payment installments. Let’s look at the distribution of these column individually. Starting with installments
:
# Histogram of installments
plt.hist(hist_trans['installments'],
bins=np.arange(-25.5, 12.5, 1.0))
plt.axvline(x=0, color='k', linestyle='--')
plt.ylabel('Count')
plt.xlabel('installments')
plt.yscale('log', nonposy='clip')
plt.show()
Hmm, values of -1 and -25? Those are probably to indicate missing values, so we’ll set the installments with negative numbers to NaN:
# Set negative installments to nan
hist_trans.loc[hist_trans['installments']<0, 'installments'] = np.nan
new_trans.loc[new_trans['installments']<0, 'installments'] = np.nan
There’s a similar problem for city_id
and state_id
:
# Histogram of city_id and state_id
plt.subplot(121)
plt.hist(hist_trans['city_id'],
bins=np.arange(-10.5, 12.5, 1.0))
plt.axvline(x=0, color='k', linestyle='--')
plt.ylabel('Count')
plt.xlabel('city_id')
plt.yscale('log', nonposy='clip')
plt.subplot(122)
plt.hist(hist_trans['state_id'],
bins=np.arange(-10.5, 12.5, 1.0))
plt.axvline(x=0, color='k', linestyle='--')
plt.ylabel('Count')
plt.xlabel('state_id')
plt.yscale('log', nonposy='clip')
plt.show()
We’ll set those to NaN too, seeing as they likely indicate missing values.
# Set negative ids to nan
hist_trans.loc[hist_trans['city_id']<0, 'city_id'] = np.nan
new_trans.loc[new_trans['city_id']<0, 'city_id'] = np.nan
hist_trans.loc[hist_trans['state_id']<0, 'state_id'] = np.nan
new_trans.loc[new_trans['state_id']<0, 'state_id'] = np.nan
Exploratory Data Analysis
Now that we’ve loaded and cleaned the data, let’s do some EDA!
Cards
The cards dataset (split into test.csv
and train.csv
) contain information about the credit card accounts for which we’re trying to predict the customer loyalty. Pretty much all this dataset contains is the card ID, the loyalty score, the month the account was first active, and three anonymized categorical features.
describe_df(cards)
Rows: 325540
Columns: 5
Memory usage: 7487420 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
first_active_month datetime64[ns] 1 2011-11-01 00:00:00 2018-02-01 00:00:00 76 unique values
feature_1 uint8 0 1 3.1068 5 3 [5 4 2 1 3]
feature_2 uint8 0 1 1.744 3 1 [2 1 3]
feature_3 uint8 0 0 0.56512 1 1 [1 0]
target float32 123623 -33.21928 -0.39363 17.965069 196910 unique values
Let’s start by taking a look at the distribution of loyalty scores, which is the target variable we’ll be trying to predict when we build a predictive model.
# Loyalty score (aka the target)
cards['target'].hist(bins=50)
plt.xlabel('Loyalty score')
plt.ylabel('Number of cards')
plt.title('Loyalty Score Distribution')
plt.show()
Yikes. The first, most obvious thing is those outliers at around -32. About 1% of the training data have these outlier values:
print('Percent of cards with values <20: %0.3f'
% (100*(cards['target']<-20).sum() /
cards['target'].notnull().sum()))
Percent of cards with values <20: 1.093
And all of these outlier values are exactly the same:
cards.loc[cards['target']<-20, 'target'].unique()
array([-33.21928024])
Those outliers are definitely something we’ll have to pay attention to when building a predictive model. With a normal linear model (like, say, a non-robust linear regression) the outliers would have an enormus effect on the model. We might want to perform some kind of outlier detection and train a separate model only on non-outlier data.
Another thing to note is the shape of the distribution - it’s not normally distributed (the tails are too heavy and the central part is too sharp). To me, it looks more like a log ratio distribution:
# Show a log ratio distribution
a = np.random.rand(10000)
b = np.random.rand(10000)
c = np.log(a/b)
plt.hist(c, bins=np.linspace(-35, 20, 50))
plt.title('Log Ratio Distribution')
plt.show()
This suggests that the loyalty score may actually be the log of the ratio of two numbers, for example “number of new purchases” to “number of historical purchases”. The outlier values then may be cases when the denominator was zero. Since the log of zero is negative infinity, a small positive number may have been added to the denominator to prevent infinite loyalty scores. A small number like 1e-14
ends up being pretty close to the outlier values when the log is taken:
np.log(1e-14)
-32.23619130191664
The cards dataset also contains the month in which the account was first active. Looking at the distribution for that variable, we can see a progressive ramping up of the number of cards, with peaks just before the new year.
# Plot first_active_month distribution
fam = cards['first_active_month'].value_counts()
fam.plot()
plt.ylabel('Number of Cards')
plt.xlabel('first_active_month')
plt.show()
We can also look at the loyalty score (our target variable) as a function of the month the account was first active. The confidence intervals in the plot below shrink the closer we get to 2018, because there are more samples at later timepoints (which you can see in the plot above). There doesn’t seem to be a huge change in the loyalty score as a function of first active month, except perhaps a slight uptick in loyalty scores over the course of 2017.
# first_active_month vs loyalty score
sns.lineplot(x='first_active_month', y='target', data=cards)
plt.show()
Lastly, the cards dataset contains three categorical columns (with intentionally obfuscated titles): feature_1
, feature_2
, and feature_3
.
# category counts
plt.figure(figsize=(6.4, 10))
plt.subplot(311)
sns.countplot(x='feature_1', data=cards)
plt.subplot(312)
sns.countplot(x='feature_2', data=cards)
plt.subplot(313)
sns.countplot(x='feature_3', data=cards)
plt.tight_layout()
plt.show()
The three categorical distributions are reasonably well-balanced (at least, the categories all have within an order of magnitude of occurrences).
If we look at the loyalty score (the target variable) as a function of these categories, it appears that there are small differences in the loyalty score as all three of the category values increase (except perhaps when feature_2
=2).
# Features vs loyalty score with std dev
plt.figure(figsize=(6.4, 10))
plt.subplot(311)
sns.barplot(x='feature_1', y='target', data=cards)
plt.subplot(312)
sns.barplot(x='feature_2', y='target', data=cards)
plt.subplot(313)
sns.barplot(x='feature_3', y='target', data=cards)
plt.tight_layout()
plt.show()
However, the error bars plotted above are the estimated confidence intervals of the mean - the standard deviations are much larger. If we set the error bars to show the standard deviation, we can see the raw variance in the data is huge compared to the differences between categories:
# Features vs loyalty score with std dev
plt.figure(figsize=(6.4, 10))
plt.subplot(311)
sns.barplot(x='feature_1', y='target',
data=cards, ci='sd')
plt.subplot(312)
sns.barplot(x='feature_2', y='target',
data=cards, ci='sd')
plt.subplot(313)
sns.barplot(x='feature_3', y='target',
data=cards, ci='sd')
plt.tight_layout()
plt.show()
Merchants
The merchants dataset has information about every merchant that any credit card account made a transaction with. Now that we’ve cleaned and loaded that dataset, let’s take a look again at the basic information about each column.
describe_df(merchants)
Rows: 334633
Columns: 16
Memory usage: 26101374 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
merchant_id float32 0 0.0 1.6732e+05 334632.0 334633 unique values
merchant_group_id uint32 0 1 31033 112586 35 109389 unique values
numerical_1 float32 0 -0.05747065 0.010345 183.7351 949 unique values
numerical_2 float32 0 -0.05747065 0.0070865 182.07932 941 unique values
most_recent_sales_range uint8 0 0 3.3885 4 4 [4 3 2 1 0]
most_recent_purchases_range uint8 0 0 3.3828 4 4 [4 3 2 1 0]
avg_sales_lag3 float64 13 -2.5 1.0017 3.0303 3372 unique values
avg_purchases_lag3 float64 0 0 0.97086 2.9985 99890 unique values
active_months_lag3 uint8 0 1 2.9941 3 3 [3 2 1]
avg_sales_lag6 float64 13 -1.3514 1.0153 6.25 4507 unique values
avg_purchases_lag6 float64 0 0 0.96975 5.9864 135004 unique values
active_months_lag6 uint8 0 1 5.9474 6 6 [6 2 1 3 4 5]
avg_sales_lag12 float64 13 -1.7544 1.011 11.111 5009 unique values
avg_purchases_lag12 float64 0 0 0.96304 10.17 172513 unique values
active_months_lag12 uint8 0 1 11.599 12 12 12 unique values
category_4 uint8 0 0 0.28704 1 0 [0 1]
The merchant_group_id
column contains what group the merchant belongs to. Presumably this corresponds to a business group (e.g. “Walmart”, and individual merchants are individual stores), and not some sort of business sector identifier. One single group has around 50,000 merchants in it, and around 3,000 have >10 merchants, but most have less than 10:
# Show number of merchants per merchant group
mpmg = merchants['merchant_group_id'].value_counts().values
plt.plot(np.arange(len(mpmg))+1, mpmg)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Unique merchant group')
plt.ylabel('Number of merchants')
plt.show()
The merchants dataset also contains two anonymized features: numerical_1
and numerical_2
. The two distributions are very similar and are both very skewed (notice the log Y axis).
# Raw distributions for numerical cols
plt.figure()
plt.subplot(121)
merchants['numerical_1'].hist()
plt.yscale('log')
plt.ylabel('Number of merchants')
plt.xlabel('numerical_1')
plt.subplot(122)
merchants['numerical_2'].hist()
plt.yscale('log')
plt.xlabel('numerical_2')
plt.tight_layout()
plt.show()
Not only do the two distributions look very similar, but the two variables are highly correlated:
rho, pv = spearmanr(merchants['numerical_1'].values,
merchants['numerical_2'].values)
print('Correlation coefficient = %0.3f' % rho)
Correlation coefficient = 0.891
In fact, for 90% of the merchants, the two values are exactly the same!
(merchants['numerical_1'] == merchants['numerical_2']).mean()
0.89946
The most_recent_sales_range
and most_recent_purchases_range
contain what bin the merchant falls into in terms of recent sale amounts and purchase amounts, respectively. There are a lot more merchants with higher category values here, suggesting the lower the category value, the higher the dollar amount (assuming the merchant-sales relationship is a log-like relationship, which is common).
# most_recent_sales_range and most_recent_purchases_range
plt.figure()
plt.subplot(121)
sns.countplot(x='most_recent_sales_range',
data=merchants)
plt.yscale('log')
plt.subplot(122)
sns.countplot(x='most_recent_purchases_range',
data=merchants)
plt.yscale('log')
plt.tight_layout()
plt.show()
Not suprisingly, the two categorical columns are correlated - a merchant’s profits likely correspond at least somewhat closely to their expenses.
rho, pv = spearmanr(merchants['most_recent_sales_range'].values,
merchants['most_recent_purchases_range'].values)
print('Correlation coefficient = %0.3f' % rho)
Correlation coefficient = 0.663
But, the categories aren’t perfectly correlated:
# Show joint counts
countplot2d('most_recent_sales_range',
'most_recent_purchases_range',
merchants, log=True)
The avg_{sales,purchases}_lag{3,6,12}
columns store normalized sales and purchases in the past 3, 6, and 12 months. During the data loading and cleaning, we took the inverse of these columns so they’d be less ridiculously skewed. So, in the plots below, values closer to 0 indicate larger average purchases or sales.
def plothist3(df, cols, bins=30):
plt.figure(figsize=(6.4, 10))
for i, lab in enumerate(cols):
plt.subplot(3, 1, i+1)
merchants[lab].hist(bins=bins)
plt.xlabel(lab)
plt.tight_layout()
plt.show()
plothist3(merchants,
['avg_sales_lag3', 'avg_sales_lag6', 'avg_sales_lag12'],
bins=np.linspace(-1, 3, 41))
plothist3(merchants,
['avg_purchases_lag3',
'avg_purchases_lag6',
'avg_purchases_lag12'],
bins=np.linspace(-1, 3, 41))
The active_months_lag
columns contain how many months in the past 3, 6, or 12 months the merchant has been active. The vast majority of merchants are active every month (notice the log Y axis in the plots below), though there are arount 1,000 merchants who are only active during one or two months in the past year!
plothist3(merchants,
['active_months_lag3',
'active_months_lag6',
'active_months_lag12'],
bins=np.linspace(0.5, 12.5, 13))
The last column in the dataset is an anonymized binary column called category_4
.
# category_4
sns.countplot(x='category_4', data=merchants)
plt.show()
Transactions
Again, the transactions datasets contain records of each transaction made with a card in the cards dataset and a merchant in the merchants dataset. They are split into two separate datasets: historical transactions and “new merchant” transactions. The “new” transactions are transactions which occurred in a period leading up to the time the loyalty score was calculated, and the historical transactions are all transactions before that (back to the start date of the datset, at least).
To explore the transactions in the dataset collectively, here we’ll merge historical and new transactions. However, when building a predictive model, we’ll want to keep them separate, so that our model is given information as to when the loyalty score was calculated.
# Merge new and historical transactions
trans = pd.concat([hist_trans, new_trans])
# Show info about the merged table
describe_df(trans)
Rows: 31075392
Columns: 14
Memory usage: 2050975872 Bytes
Column Dtype Nulls Min Mean Max Mode Uniques
authorized_flag uint8 0 0 0.91901 1 1 [1 0]
card_id uint32 0 0 1.6255e+05 325539 78027 325540 unique values
city_id float64 2577973 1 141.46 347 307 unique values
category_1 uint8 0 0 0.069094 1 0 [0 1]
installments float64 234271 0 0.65705 12 13 unique values
category_3 float32 234081 0.0 0.53503 2.0 [0. 1. 2.]
merchant_category_id int16 0 -1 477.85 891 705 331 unique values
merchant_id float32 164697 0.0 1.5495e+05 334632.0 334633 unique values
month_lag int8 0 -13 -4.1106 2 -2 16 unique values
purchase_amount float32 0 -0.7469078 0.022308 6010604.0 221246 unique values
purchase_date datetime64[ns] 0 2017-01-01 00:00:08 2018-04-30 23:59:59 17717516 unique values
category_2 float32 2764609 1.0 1.7651 5.0 [1. 3. 5. 2. 4.]
state_id float64 2764609 1 11.718 24 24 unique values
subsector_id int8 0 -1 26.793 41 33 41 unique values
We’ll take a look at the distribution of the values for each feature. First off, the authorized_flag
column, which stores whether the transaction was authorized or rejected.
# authorized_flag
sns.countplot(x='authorized_flag', data=trans)
plt.title('Number of authorized transactions')
plt.show()
Most transactions were authorized, but there’s a sizeable chunk of transactions which were declined. The proportion of authorized transactions may end up being an important feature in our predictive model, seeing as card owners who are having their transactions repeatedly declined may be less “loyal” to the card company. If there were an insignificant number of declined transactions we might just want to toss the column entirely - but, seeing as there are so many, we should keep it in case it ends up being an important predictor of loyalty.
Next, we’ll look at card_id
. This is essentially the index of the cards dataset, so by counting the number of transactions per unique card_id
, we can see how many transactions were made on each card. This too seems likely to be an important feature for predicting customer loyalty.
# card_id
plt.plot(trans['card_id'].value_counts().values)
plt.xlabel('Unique card_id')
plt.ylabel('Number of transactions')
plt.title('Number of transactions per card_id')
plt.show()
It looks like there are a few customers (card_id
s) with a very high number of transactions but most of the card_id
s have less than 500 or so transactions.
Let’s take a look at the number of transactions per city and state. There’s one city with a very high number of transactions, but most have between 10 thousand an a million transactions (except for a few with a very low number of transactions).
# city_id
plt.figure(figsize=(5, 40))
sns.countplot(y='city_id', data=trans,
order=trans['city_id'].value_counts().index)
plt.yticks(fontsize=8)
plt.xscale('log')
plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.title('Number of transactions per City')
plt.show()
The number of transactions per state show a similar relationship: one state has a little over ten million transactions, and the rest have between 100 thousand and one million (except for one which has less than 10,000!).
# state_id
plt.figure(figsize=(8, 4))
sns.countplot(x='state_id', data=trans,
order=trans['state_id'].value_counts().index)
plt.ylabel('Count')
plt.xlabel('state_id')
plt.yscale('log')
plt.gca().xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.title('Number of transactions per State')
plt.show()
There are three anonymized categorical variables included in the transactions dataset: category_1
, category_2
, and category_3
. The category names are intentionally completely noninformative presumably in order to preserve either anonymity or proprietary information. However, we can still look at the distributions for each of these three variables
# category_1
sns.countplot(x='category_1', data=trans)
plt.show()
# category_2
sns.countplot(x='category_2', data=hist_trans)
plt.gca().xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.show()
# category_3
sns.countplot(x='category_3', data=trans)
plt.gca().xaxis.set_major_formatter(FormatStrFormatter('%.0f'))
plt.show()
Hmm, the distribution for category_1
looks a lot like the authorized_flag
distribution, just flipped (i.e. 1 instead of 0 and vice-versa). Are the two features redundant?
# Joint distribution of authorized_flag vs category_1
N, e1, e2 = np.histogram2d(trans['authorized_flag'],
trans['category_1'],
bins=[[-0.5, 0.5, 1.5], [-0.5, 0.5, 1.5]])
sns.heatmap(N.astype('int64'), annot=True, fmt='d')
plt.xlabel('authorized_flag')
plt.ylabel('category_1')
plt.show()
OK, authorized_flag
and category_1
aren’t identical, just imbalanced in a similar way. If they had been identical we would have wanted to remove one or the other feature. As is, we’ll keep both - especially considering we don’t know what category_1
is representing.
Next up, we’ll look at the number of installments for each transaction - that is, how many payments were made in order to pay off a given purchase.
# installments
plt.hist(trans['installments'], bins=np.arange(-0.5, 12.5, 1.0))
plt.axvline(x=0, color='k', linestyle='--')
plt.ylabel('Count')
plt.xlabel('installments')
plt.yscale('log', nonposy='clip')
plt.show()
The maximum number of installments is only 11 because the dataset spans only one year - presumably installments are made on a monthly basis, and so no customers have made more than 11 installments over the course of the year.
The merchant_category_id
column contains categorical information as to the type of merchant with which the transaction was made. These categories have also been anonymized, so we have only integer values in this column, but presumably these integers correspond to merchant categories such as “fast food”, “general contracting”, “telecommunication”, or “deparment stores”. The ids may even correspond to merchant category codes. Let’s take a look at the number of transactions per merchant category:
# merchant_category_id
plt.figure(figsize=(5, 40))
sns.countplot(y='merchant_category_id', data=trans,
order=trans['merchant_category_id'].value_counts().index)
plt.yticks(fontsize=8)
plt.gca().set(xscale='log')
plt.show()
Similarly, the subsector_id
likely contains information about the type of merchant with which the transaction occurred. The value may be an anonymized version of the Brazillian equivalent of NAICS codes.
# subsector_id
plt.hist(hist_trans['subsector_id'], bins=np.arange(0.5, 41.5, 1.0))
plt.ylabel('Count')
plt.xlabel('subsector_id')
plt.yscale('log')
plt.show()
The merchant_id
column contains the ID of the merchant with which the transaction occurred. The number of transactions per merchant show a pretty clean log-log relationship (untill we get to merchants with below ~30 transactions ever):
# merchant_id
plt.plot(np.arange(1, trans['merchant_id'].nunique()+1),
trans['merchant_id'].value_counts().values)
plt.xlabel('Unique merchant_id')
plt.ylabel('Number of transactions')
plt.title('Number of transactions per merchant_id')
plt.xscale('log')
plt.yscale('log')
plt.show()
There’s also a pretty clear log-log relationship between the number of purchases and the purchase amount, until we get to purchases with below ~10 purchases of that amount (the dollar units are normalized, so some are negative, and so I’ve offset the x-axis below to allow for a log x scale, and binned the purchase amounts).
# purchase_amount
counts, be = np.histogram(hist_trans['purchase_amount'],
bins=np.arange(-1, 6010605, 5))
plt.plot(be[:-1]-min(be[:-1])+1, counts)
plt.ylabel('Count')
plt.xlabel('purchase_amount')
plt.xscale('log')
plt.yscale('log')
plt.show()
The purchase_date
column contains when the purchase occurred. This will likely be an important feature for our predictive model. Card owners who use their cards regularly (or whose use is increasing) are probably more likely to have a higher loyalty score. Also, in general, our model will need to know when each transaction occurred! Let’s take a look at how many transactions occurred as a function of the date:
# Function to plot transactions over time
def transactions_over_time(df):
tpd = pd.DataFrame()
tpd['Transactions'] = (
df['purchase_date'].dt.year*10000 +
df['purchase_date'].dt.month*100 +
df['purchase_date'].dt.day
).value_counts()
tpd['Date'] = pd.to_datetime(tpd.index, format='%Y%m%d')
tpd.plot('Date', 'Transactions')
plt.ylabel('Number of transactions')
plt.show()
# purchase_date
transactions_over_time(trans)
Huh. The most obvious thing is the sudden dropoff in March 2018, and perhaps a smaller one in January 2018. This is actually just due to how the data has been split up in to new and historical transactions. The historical transactions simply end mid-march:
# purchase_date
transactions_over_time(hist_trans)
And the new transactions don’t start (for most cards) until mid-march, but the number of transactions is far less. I’m not actually sure why that is - it could be that not all new transactions were included in the dataset.
# purchase_date
transactions_over_time(new_trans)
The second interesting thing to notice is that there’s a pretty significant weekly cycle (creating the high-frequency ups and downs in the plots above). Looking at the number of transactions as a function of day of the week, we can see the the number of transactions ramps up over the course of the week, plummets on Sunday, and then starts climbing again on Monday.
counts = (trans['purchase_date']
.dt.dayofweek.value_counts())
plt.bar(x=counts.index, height=counts.values)
plt.ylabel('Number of transactions')
plt.xticks(range(7), ['M', 'T', 'W', 'Th', 'F', 'Sa', 'Su'])
plt.show()
Finally, the month_lag
column contains the number of months between the transaction date and the time when the loyalty score was calculated. Again, because the dataset spans only about a year, we don’t see that manymonth_lag
s less than -12, and none greater than 0, because the dataset doesn’t include data after the loyalty score was calculated.
# month_lag
plt.hist(trans['month_lag'], bins=np.arange(-13.5, 0.5, 1.0))
plt.ylabel('Count')
plt.xlabel('month_lag')
plt.yscale('log')
plt.show()
Conclusion
Now that we’ve cleaned and gotten an understanding of the data, the next step is to connect the various data tables together and create features which a predictive model can use to predict customer loyalty. In the next post, we’ll focus on feature engineering!