Option 1: Reducing features

We identify three families of techniques for reducing features: feature selection, feature extraction, embeddings.

Below, we present a number of methodologies for each family.

Not all features collected will be useful for the predictive task at hand. Feature selection techniques help reduce the number of variables to feed the machine learning model. There is a very large number of techniques available for features selection, we will cover here some common techniques that can be implemented with the sklearn library:

  • Variance Threshold: it removes low-variance features from the dataset.

  • Univariate feature selection: it keeps those features which perform best according to a univariate statistical test.

    1. The k best features can be selected according to the score, the false positive rate, the false discovery rate, or the family-wise error rate. We could also select the features scoring above a given percentile.

    2. The univariate tests include:

    3. Recursive feature elimination: An external estimator is trained on the dataset and its features are ranked according to their impact on prediction. The least important features are then removed. This operation is iterated until the desired number of features is reached.

    4. Model-based feature selection: We could also use any model which assigns a measure of features importance to score features. Examples include Lasso for regression, and Logistic Regression or SVM for classification. Decision trees are also a possible choice.

    5. Sequential feature selection: This algorithm either adds (forward selection) or removes (backward selection) one feature at a time by iteratively picking the best one at each step.

You can find an example of feature selection for data minimization in the notebook, which can be visualized here, or downloaded as the following file:

Last updated