A Complete Guide to Exploratory Data Analysis on Structured Data
Exploratory Data Analysis (EDA) is used to analyze and understand data sets to unearth hidden patterns and insights. In Data Science, this is often employed as the preliminary step in any Advanced Analytics task.
There are numerous objectives of conducting an EDA:
● Maximizing the insight that can be gained from a data set.
● Uncovering the underlying structure.
● Extracting important variables.
● Detecting outliers and anomalies.
● Testing underlying assumptions.
● Developing and testing hypothesis.
● Determining the optimal factor settings.
If done properly, EDA answers following core questions:
- How do we dissect a data set?
- What do we look for in Data?
- How do we look at Data ?
- How do we interpret Data ?
The objective of this article is not to explain these techniques in detail, but to provide a guideline into which technique to apply when. The types of techniques that one might wish to apply would depend on the type of data and the objective of the task. We can divide the data into two types: structured and unstructured. In this article, we will only be focusing on structured data.
The following matrix can be used as a framework when selecting the EDA techniques based on the type of EDA and whether the requirement is a graphical or a non-graphical representation.
Each of the techniques listed in this matrix will be discussed with examples in the following section.
Types of EDA
In this section, a marketing campaign dataset available in the Kaggle website was used to discuss relevant techniques with examples of their usage. This data set contains customer information, such as demographic information and purchase history, and their responses to a particular marketing campaign. We will conduct an analysis using this dataset to understand the dataset and derive useful insights using various EDA techniques.
1. Univariate Analysis
Univariate Analysis is the simplest form of analyzing data. It does not deal with causes or relationships, and its main purpose is to describe data by summarizing and finding patterns, and drawing conclusions from the findings. Few techniques under this analysis are:
● Central Tendency Measures (mean, median, and mode)
● Dispersion/Spread of Data (range, minimum, maximum, quartiles, variance, and standard deviation)
Moreover, frequency distribution tables, histograms, pie charts, and bar charts can be used to visualize these patterns.
2. Multivariate Analysis
The main objectives of Multivariate Analysis are,
● Identifying correlation and dependencies of variables
● Constructing and testing a Hypothesis
● Gaining insights for feature selection and engineering
Correlation matrices, line charts, scatterplots and heatmaps can be used to visualize these patterns.
Clustering reduces a large data set to meaningful subgroups. The division is based on the similarity of the objects across a set of specified characteristics.
Four main rules for developing clusters,
● Should be different
● Should be reachable
● Should be measurable
● Should be profitable (large enough to be noteworthy)
4. Pareto Analysis
Commonly known as the ‘80/20 rule’, Pareto Analysis statistically separates a limited number of input factors based on which has the greatest impact on the outcome. This principle is applicable for many day-to-day scenarios in the real world.
Vilfred Pareto, an Italian economist and sociologist introduced this theory in 1896 by demonstrating that 80% of the wealth of Italy was distributed among 20% of the population, and the remaining 20% of the wealth was distributed among the other 80% of the population.
5. Frequent Itemset Mining
Characterized by ‘thinking outside the basket’, this technique is widely used in Market Basket Analysis. This is a rule-based Machine Learning method for discovering relations between variables in large databases by analyzing patterns or co-occurrences. Apriori and FP-Growth are two common algorithms used for Association Rule Mining. When applying Frequent Itemset Mining in EDA, similar to how items bought together in a particular transaction becomes the list of items in that market basket, the set of attributes in a record becomes the list of items.
6. Outlier Detection
This is the process of identifying unexpected items or events in data sets, which differ from the norm. However, anomalous data can indicate critical incidents, such as a technical glitch, or potential opportunities, for instance a change in consumer behavior, or simply erroneous data. Even though there are many methods to identify outliers, it is important to apply domain knowledge to describe these outliers.
● Scatter plots
Machine Learning Models:
● One class SVM
● DBSCAN (Clustering Method)
Volatility is a commonly used technique in Stock Analysis. This is an ideal analysis of variation for time series data. This is a calculation that says how much data changes/varies during each interval. There are many techniques of calculating Volatility. The following example shows the historic volatility calculation for a sample stock data set at a daily value.
There are numerous EDA techniques. However, it might be impractical to conduct all these Analyses during the short life cycle of a Data Science Project. Moreover, these techniques might not be useful for all analytics tasks. Hence, it is important to select which technique is the most suitable based on the problems and the dataset at hand.
Written by: Chamodi Adikaram, Data Scientist.