Feature Selection with Univariate Filtering

OCTAVE - John Keells Group
4 min readJan 17, 2023

--

Written by Dinusha Dissanayake, Senior Data Scientist at OCTAVE and Dr. Rajitha Navarathna, Principal Data Scientist at OCTAVE

From Unplash By Jacek

Food becomes more flavourful when the proper ingredients in the right proportion are used. Likewise an analytical model becomes more effective and efficient when the right set of features are used.

The variables we use to estimate our target variable in a model are called features. In structured data, columns usually indicate features, and there are plenty of models with over 1000 features. It can be agreed that some of these variables will be valuable in revealing hidden business values but having needless features in the model will often lead to poor outcomes.

Image by Author: Impact of higher number of features

As a result, choosing the right set of features to utilise in a model is critical. The term used for this is Feature selection.

Different ways of Feature Selection

Feature selection can be divided into three categories,

  1. Filtering methods
  2. Wrapper methods
  3. Embedded methods

Filtering methods

There are two techniques in the Filtering Method

1. Univariate

2. Multivariate

Univariate approaches examine each feature on its own. It examines its features, ranks them, and selects the features accordingly on a set of criteria. The disadvantage is that the link between the attributes is ignored. By studying the entire space, multivariate feature selection solves this difficulty.

However, before applying multivariate feature selection it is always better to analyse for univariate methods as it helps with feature understanding and data cleansing as well.

Basic Filtering approaches

In some basic filtering approaches, correlation-based methods and statistics-based methods are included in feature selection with univariate filtering.

Let’s look into this through an example. assume below is a set of features used for an analysis,

Image by Author

We can observe in this example,

· Tenure has a larger rate of missing values and one can decide to drop this feature

· Area has a constant value and no fluctuations, thus, there is no value addition by keeping it.

· Indexes, names, and other items with a higher variance among values indicate that they lack the necessary information to determine the target variable. In the example, name is a primary key value that contains no useful information to derive the target value.

· The year of birth and age are highly correlated, thus having the same variable with no additional value is useless. One variable can be removed.

Like above through a simple analysis of each feature, some non-value adding features could be removed.

Statistics based Filter Methods

There is another set of univariate filtering which depends on statistical tests. Few of the most used tests are mentioned below.

· F test:

The F test for linear regression determines whether any independent variable is significant, whereas the Anova_ F test determines whether the target variable is independent to the feature. If any variable is independent with the target variable, those features might not be very relevant.

· Chi square test (χ²):

Using their frequency distribution, the Chi-squared test determines if the occurrences of a certain feature and a specific class are independent. At feature selection only the features which is highly dependent on the target variable that will be selected. Higher χ² value will indicate that null hypothesis of independence should be rejected.

· Mutual information (MI):

This metric assesses how much information a feature’s existence or absence adds to a correct prediction. The MI value will be non-negative, indicating that two variables are interdependent. A larger value suggests greater dependency, whereas a value of 0 shows independency.

· Pearson’s correlation

This metric is used to determine the linear relationship between two continuous variables. It will range from -1 to +1, with 0 indicating no correlation, a negative relationship on the negative side, and a positive relationship on the positive side.

· Spearman’s rank correlation

This determines the strength of relationship between two ordinal (categorical) or continuous variables.

Conclusion

If there is no relationship between the target variable and the features in any of the methods described, there may be less value in preserving them as model features. However, one downside of univariate analysis is that it only evaluates one variable at a time, which may obscure whether there is a relationship between the target variable and the combination of attributes. As a result, it’s critical to keep that in mind when choosing features.

--

--

OCTAVE - John Keells Group
OCTAVE - John Keells Group

Written by OCTAVE - John Keells Group

OCTAVE, the John Keells Group Centre of Excellence for Data and Advanced Analytics, is the cornerstone of the Group’s data-driven decision making.

No responses yet