Option 1: Reducing features
Last updated
Last updated
We identify three families of techniques for reducing features: feature selection, feature extraction, embeddings.
Below, we present a number of methodologies for each family.
Not all features collected will be useful for the predictive task at hand. Feature selection techniques help reduce the number of variables to feed the machine learning model. There is a very large number of techniques available for features selection, we will cover here some common techniques that can be implemented with the sklearn library:
Variance Threshold: it removes low-variance features from the dataset.
Univariate feature selection: it keeps those features which perform best according to a univariate statistical test.
The k best features can be selected according to the score, the false positive rate, the false discovery rate, or the family-wise error rate. We could also select the features scoring above a given percentile.
The univariate tests include:
Computing mutual information, ANOVA and chi-squared statistics for classification tasks with discrete target variables.
Computing mutual information and F-statistic for regression with a continuous target variable.
Recursive feature elimination: An external estimator is trained on the dataset and its features are ranked according to their impact on prediction. The least important features are then removed. This operation is iterated until the desired number of features is reached.
Model-based feature selection: We could also use any model which assigns a measure of features importance to score features. Examples include Lasso for regression, and Logistic Regression or SVM for classification. Decision trees are also a possible choice.
Sequential feature selection: This algorithm either adds (forward selection) or removes (backward selection) one feature at a time by iteratively picking the best one at each step.
You can find an example of feature selection for data minimization in the notebook, which can be visualized here, or downloaded as the following file: