The Different Outlier Types and the Importance of Detecting Them
Written by Dinusha Dissanayake, Senior Data Scientist at OCTAVE and Dr. Rajitha Navarathna, Principal Data Scientist at OCTAVE
Data provides enormous value in today’s world. They’re being recorded everywhere and analyzed to extract useful information. Advanced approaches and tools for analysis are being created every day. The fact that data isn’t always perfect is one of the drawbacks of analysis. They could have anomalies or inconsistencies that jeopardize the entire process.
In literature, there are many different types of definitions for outliers. Most of them give different specific definitions, but they can generally be described as
“a data point that is significantly dissimilar to other data points or a point that does not imitate the expected typical behavior of the other points”.
In simpler terms, outliers are data points that fall outside of an expected distribution.
An outlier, like the old adage “every coin has two sides,” can be one bad data point that disrupts analysis or a unique point that leads to interesting findings.
In the first case, outliers are malcontents for data analysts trying to define trends in the data because they distort or prejudice the results. It is crucial to detect and eliminate them during the exploratory analysis phase before digging deep into model construction, since this will allow for more precise insights.
On the other hand, analysts might gain crucial information by spotting outliers, which can help them make better data decisions. It can be worthwhile to dig more into the outlier to find out what makes it so special. They can be an evolving novel trend, unique element or a reason behind a scientific finding.
These kind of outlier detection is used in a variety of applications, including fraud detection, medical diagnosis, network intrusion detection, military surveillance, fault detection in safety-critical systems, mechanical fault detection / changes in system behavior, human error, and data mining applications like sudden deviations in sales data.
Types of outliers
Outlier detection, as previously indicated, has a wide range of applications and benefits. It’s crucial to understand the different sorts of outliers before spotting them. If we start looking for outliers in a population blindly, there’s a great possibility we’ll miss some of the outlier types.
Type 1: Global outliers:
Global Outliers
Global outliers, also known as “Point Anomalies,” deviates significantly from the remaining data. Regardless of the features, if the data point deviates from the global distribution it is considered as an outlier. This is the simplest form of outliers which is easy to identify relatively to others.
This can be explained by considering a real life example.
Let’s say in a pharmacy price per item is between 100–5,000. In any transaction if we see 10,500 this can be an outlier.
Most probably a human error while recording the data.
Type 2: Contextual outliers:
Contextual Outlier
Contextual outlier or a conditional outlier is when a data point is anomalous in a specific context or in a specific condition. This means same data point would be seen as normal in a different context.
Contextual outliers are basically hard to spot if there was no background information. At the problem formulation itself context will have to be defined relating to the target domain.
If we look into a common example, temperature of 2 degree might be normal in winter at UK but if u see it in summer that is an outlier. Similarly temperature of 1 degree in Antarctica is normal but it in Sri Lanka is an outlier.
It is hard to spot these outliers without knowing the context. For example if you had no idea that the values were temperatures in winter or if you don’t know the temperatures in specific countries, looking at the whole population of data these data points may be considered as valid data points.
Type 3: Collective outliers:
Collective Outlier
Collective outlier is defined when a collection of data points is different with respect to the entire data set. It is not necessary to have each individual data objects to be outliers, but when seen as a whole, they may behave as outliers. Here also we need to know the background information like relationship between data points, To detect these types of outliers,.
When considered all the above types it is advantageous if context is known as global or collective outliers may also be incorrectly identified or missed without knowing the context.
Conclusion
Data and analysis are becoming an increasingly important aspect of today’s environment, which is characterized by a data-driven decision-making culture. When working with data, it is impossible to avoid identifying and dealing with outliers, as outliers can cause a significant gap in data quality. Furthermore, in other circumstances, such as fraud detection, outliers might be the primary focus of decision-making. As outliers can distort trends and have a significant impact on the final outcome, outlier detection is an important tool in a data-driven decision-making culture.