Breaking the Black Box in Machine Learning — Part 2: Working with Categorical Features
Written by Ieshaka Karunaratne, Senior Data Scientist at OCTAVE and Dr. Rajitha Navarathna, Principal Data Scientist at OCTAVE
(resource: GitHub — slundberg/shap: A game theoretic approach to explain the output of any machine learning model.)
In the 1st part of this article, we understood how to use the SHAP library to break the black box of a machine learning model. Those who haven’t read the 1st part of this, can find it using this link. ( https://medium.com/octave-john-keells-group/breaking-the-black-box-in-machine-learning-dc1d30c1b13f)
In this article what we are trying to look at is, how can we use SHAP to break the black box when we are using algorithms such as LightGBM or CatBoost where we can directly use the categorical variables (Columns with String values) without encoding them. One method would be to encode the categorical features and use it as we do in other machine learning algorithms. However the ability to use these categorical features without encoding them is one standout feature of these algorithms. so, it’s not good to not to use that just to do SHAP. So, does this mean we cannot use SHAP if we use this option??
Absolutely not! We have a workaround that we can use. Before getting into that, let’s first train a model using this feature. For this task I’ll be using the titanic dataset. You can download the dataset from here (titanic_dataset.csv · GitHub)
· Let’s import the libraries needed
· Reading the data
· Checking the data types of the columns
· Checking if the columns have missing values
· Since we don’t have to care much about the accuracy of the model for this article, we’ll fill the missing values of the “Age” column with the mean and the 2 categorical columns as “None”
· Separating the categorical and numerical features and converting the categorical features to category type
· Splitting the data into train, validation, and test sets
· Training the model
· Scoring the model with the test data set
There’s no need to worry too much about the accuracy of the model as it’s not the main objective of this article. (LightGBM models tend to overfit if the training data size is small). Now we have a trained model that has used some categorical features (columns with string values) and we need to break the black box of this model.
Why cannot we call the SHAP library to get the SHAP values for the features as we did in the 1st article? Let’s try that now.
What has happened here is that SHAP is trying to convert the features into float values in order to calculate the SHAP values, since we have string values, it’s unable to convert them into float values. Currently the SHAP library cannot handle string values when trying to compute the SHAP values.
How can we get the feature contributions (SHAP values) now? Luckily both LightGBM and CatBoost algorithms have given us a workaround for this. It’s coming as another feature of the algorithms. We only have to call the predict method of the algorithms and pass an additional argument to it as “True”, in LightGBM’s case it’s “pred_contrib”. So, when we pass this argument as “True”, the output that we receive is the feature contributions for the variables (SHAP values) along with the expected values as the last entry.
Now we can use these values to draw the SHAP plots that I have discussed in the 1st article.
Finally, it can be said that this is the workaround that we can use to break the black box when we are using LightGBM or CatBoost algorithms with categorical features.
Happy learning!!!