Making Exploratory Data Analysis Easy:
Written by Ieshaka Karunaratne, Senior Data Scientist at OCTAVE and Dr. Rajitha Navarathna, Principal Data Scientist at OCTAVE
A significant amount of time from a Data Scientist’s / Data Analysts’ work is spent on exploratory data analysis and cleaning the data. This is a very important part of a data science life cycle as well since if we input wrong data into a model the output will be inaccurate as well. This article will look into a python library called ‘klib’ written by Andreas Kanz, that can be used to assist in the Exploratory Data Analysis EDA and data cleaning. By using this library you can,
- Check for missing values in data
- Check the correlation between the features
- Perform basic cleaning of the data
- Check the distribution of the categorical and numerical features
Let’s get hands-on with the library.
You can install the library by using “pip install klib”
Let’s import all the libraries needed.
For testing the functionality of the library, we need a sample dataset. For that I have chosen the well-known titanic dataset, you can choose any other data set also.
Now let’s see ‘klib’ in action.
When we get a new dataset, the 1st thing most of us do is to check for the missing values. ‘klib’ library has a one line of code that gives us a plot that shows us which columns have the missing values.
So, from this plot you can see that, Cabin and Age columns have the greatest number of missing values. On the other hand, another thing that we can see from this plot is, rows where there are missing values in most of the features (In this example we cannot find such cases).
Now let’s see what we can do with the categorical data. ‘klib’ has a one-line code that generates how the categorical features are distributed in the data set. From this plot we can see how many unique values there in that categorical feature and their frequencies are.
Now let’s see what we can do with numerical features. The 1st thing that comes to mind with numerical features is to check for correlation among them. ‘klib’ has 2 functions for this, 1 function will give the correlation matrix and the other will give the correlation plot.
In here the negative correlations are marked with red and the positive correlations are marked with black for easy identification.
Another thing that we do with numerical features is to check for their distribution. ‘klib’ has a one-line code for this as well.
Now let’s move into data cleaning ‘klib’ has a function for cleaning that does several things when called upon. The below are the list of things that it’ll do,
· Cleaning the column names: this will make sure that we have a common format for our column names, eg: it will remove special characters and leading and trailing white spaces and make the column names to lower case.
· Dropping empty and almost empty columns and rows: this will make sure to drop any column/row that has null values for more than 90% of the data (this is the default value). We can change the thresholds by passing the appropriate arguments to the function.
· Dropping single valued columns: This makes sure that we are dropping columns that have the same value for all the rows.
· Dropping duplicate rows: This makes sure that we are dropping the duplicate rows and if you are working with a dataset where duplicate rows make sense, then you can ignore this by passing ‘drop_duplicates’ parameter as ‘False’
· Optimizing the data types: This makes sure that we convert all the columns into an appropriate data type that will reduce the memory usage.
All the functions that I have mentioned above will be carried out when we call ‘data_cleaning’ function form ‘klib’. If not for this, we must write more lines of code to accomplish the above-mentioned functionalities.
Let’s check the memory usage and the data types of the columns before and after performing the data cleaning with ‘klib’.
You can see that the data types of the columns have changed to optimize for the memory.
These are some functions that we can use in our EDA activities to make them less time-consuming. You can read more about this library from here (GitHub — akanz1/klib: Easy to use Python library of customized functions for cleaning and analyzing data.). Hope this article assisted to quicken up your EDA.
Happy learning!!!